Assistive listening device and human-computer interface using short-time target cancellation for improved speech intelligibility

ABSTRACT

An assistive listening device includes a set of microphones including an array arranged into pairs about a nominal listening axis with respective distinct intra-pair microphone spacings, and a pair of ear-worn loudspeakers. Audio circuitry performs arrayed-microphone short-time target cancellation processing including (1) applying short-time frequency transforms to convert time-domain audio input signals into frequency-domain signals for every short-time analysis frame, (2) calculating ratio masks from the frequency-domain signals of respective microphone pairs, wherein the calculation of a ratio mask includes both a frequency domain subtraction of signal values of a microphone pair and a scaling of a resulting frequency domain noise estimate by a pre-computed phase difference normalization vector, (3) calculating a global ratio mask from the plurality of ratio masks, and (4) applying the global ratio mask, and inverse short-time frequency transforms, to selected ones of the frequency-domain signals, thereby generating audio output signals for driving the loudspeakers. The circuitry and processing may also be realized in a machine hearing device executing a human-computer interface application.

RELATED APPLICATION

This application is a Continuation-in-Part (CIP) of U.S. applicationSer. No. 16/514,669, filed on Jul. 17, 2019, which is a continuation ofPCT Application No. PCT/US2019/0420046, filed Jul. 16, 2019, whichclaims the benefit of U.S. Provisional Patent Application No.62/699,176, filed on Jul. 17, 2018, each of which is incorporated hereinby reference in its entirety.

STATEMENT OF U.S. GOVERNMENT RIGHTS

The invention was made with U.S. Government support under NationalInstitutes of Health (NIH) grant no. DC000100. The U.S. Government hascertain rights in the invention.

TECHNICAL FIELD

The invention described herein relates to systems employing audio signalprocessing to improve speech intelligibility, including for exampleassistive listening devices (hearing aids) and computerized speechrecognition applications (human-computer interfaces).

BACKGROUND

Several circumstances and situations exist where it is challenging tohear voices and conversations of other people. As one example, while incrowded areas or large crowds, it can often be challenging for mostindividuals to carry on a conversation with select people. Thebackground noise can be somewhat extreme making it virtually impossibleto hear comments/conversation of individual people. In anothersituation, those with hearing ailments can struggle with hearing ingeneral, especially when trying to separate the comments/conversation ofone individual from others in the area. This can even be a problem whilein relatively small groups. In these situation, hearing assistancedevices provide an invaluable resource.

Speech recognition is also a continual challenge for automated systems.Although great strides have been made, allowing automated voicerecognition to be implemented in several devices and/or systems, furtheradvances are possible. Generally, these automated systems still havedifficulty identifying a specific voice, when other conversations arehappening. This situation often occurs where an automated system isbeing used in open areas (e.g. office complexes, coffee shops, etc.).

The “cocktail party problem” presents a challenge for both establishedand experimental approaches from different fields of inquiry. There isthe problem itself, isolating a target talker in a mixture of talkers,but there is also the question of whether a solution can be arrived atin real time, without context-dependent training beforehand, and withouta priori knowledge of the number, and locations, of the competingtalkers. This has proved to be an especially challenging problem giventhe extremely short time-scale in which a solution must be arrived at.In order to be usable in an assistive listening device (i.e., hearingaid), any processing would have to solve this sound source segregationproblem within only a few milliseconds (ms), and must arrive at a newsolution somewhere in the range of every 5 to 20 ms, given that thespectrotemporal content of the challenging listening environment changesrapidly over time.

The hard problem here is not the static noise sources (think of theconstant hum of a refrigerator); the real challenge is competingtalkers, as speech has spectrotemporal variations that establishedapproaches have difficulty suppressing. Stationary noise has a spectrumthat does not change over time, whereas interfering speech, with itsspectrotemporal fluctuations, is an example of non-stationary noise.

There are various established methods that are effective for suppressingstationary noise. However, these established methods do not provide anintelligibility benefit in non-stationary noise (i.e., interferingtalkers). What is needed to solve this problem is a time-varying filtercapable of computing a new set of frequency channel filter weights everyfew milliseconds, so as to suppress the rapid spectrotemporalfluctuations of non-stationary noise (i.e., interfering talkers).Various attempts to address these problems have been made, however manyare not able to operate efficiently, or in real-time. Consequently, thechallenge of suppressing non-stationary noise from interfering soundsources still exists.

SUMMARY

What is needed to solve the above mentioned problem is a time-varyingfilter capable of computing a new set of frequency channel weights everyfew milliseconds, so as to suppress the rapid spectrotemporalfluctuations of non-stationary noise. The devices described hereincompute a time-varying filter, with causal and memoryless “frame byframe” short-time processing that is designed to run in real time,without any a priori knowledge of the interfering sound sources, andwithout any training. The devices described herein enhance speechintelligibility in the presence of both stationary and non-stationarynoise (i.e., interfering talkers).

The devices described herein leverage the computational efficiency ofthe Fast Fourier Transform (FFT). Hence, they are physically andpractically realizable as devices that can operate in real-time, withreasonable and usable battery life, and without reliance on signifcantcomputational resources. The processing is designed to use short-timeanalysis windows in the range of 5 to 20 ms; for every analysis frame,frequency-domain signals are computed from time-domain signals, a vectorof frequency channel weights are computed and applied in the frequencydomain, and the filtered frequency domain signals are converted backinto time domain signals.

In one variation, an Assistive Listening Device (ALD) employs an array(e.g., 6) of forward-facing microphones whose outputs are processed byShort-Time Target Cancellation (STTC) to compute a Time-Frequency (T-F)mask (i.e., time-varying filter) used to attenuate non-target soundsources in Left and Right near-ear microphones. The device can enhancespeech intelligibility for a target talker from a designated lookdirection while preserving binaural cues that are important for spatialhearing.

In another application, STTC processing is implemented as acomputer-integrated front-end for machine hearing applications such asAutomatic Speech Recognition (ASR) and teleconferencing. More generally,the STTC front-end approach may be used for Human-Computer Interaction(HCI) in environments with multiple competing talkers, such asrestaurants, customer service centers, and air-traffic control towers.Variations could be integrated into use-environment structures such asthe dashboard of a car or the cockpit of an airplane.

More particularly, in one aspect an assistive listening device isdisclosed that includes a set of microphones generating respective audioinput signals and including an array of the microphones being arrangedinto pairs about a nominal listening axis with respective distinctintra-pair microphone spacings, and a pair of ear-worn loudspeakers.Audio circuitry is configured and operative to performarrayed-microphone short-time target cancellation processing including(1) applying short-time frequency transforms to convert the audio inputsignals into respective frequency-domain signals for every short-timeanalysis frame, (2) calculating respective pair-wise ratio masks andbinary masks from the frequency-domain signals of respective microphonepairs of the array, wherein the calculation of a ratio mask includes afrequency domain subtraction of signal values of a microphone pair, (3)calculating a global ratio mask from the pair-wise ratio masks and aglobal binary mask from the pair-wise binary masks, (4) calculating athresholded ratio mask, an effective time-varying filter with a vectorof frequency channel weights for every short-time analysis frame, fromthe global ratio mask and global binary mask, and (5) applying thethresholded ratio mask, and inverse short-time frequency transforms toselected ones of the frequency-domain signals to generate audio outputsignals for driving the loudspeakers. Although the preferred processinginvolves using the thresholded ratio mask to produce the output, aneffective assistive listening device that enhances speechintelligibility could be built using only the global ratio mask.

In another aspect, a machine hearing device is disclosed that includesprocessing circuitry configured and operative to execute a machinehearing application to identify semantic content of a speech signalsupplied thereto and to perform an automated action in response to theidentified semantic content, and a set of microphones generatingrespective audio input signals and including an array of the microphonesarranged into pairs about a nominal listening axis with respectivedistinct intra-pair microphone spacings. Audio circuitry is configuredand operative to perform arrayed-microphone short-time targetcancellation processing including (1) applying short-time frequencytransforms to convert the audio input signals into respectivefrequency-domain signals for every short-time analysis frame, (2)calculating respective pair-wise ratio masks and binary masks from thefrequency-domain signals of respective microphone pairs of the array,wherein the calculation of a ratio mask includes a frequency domainsubtraction of signal values of a microphone pair, (3) calculating aglobal ratio mask from the pair-wise ratio masks and a global binarymask from the pair-wise binary masks, (4) calculating a thresholdedratio mask, an effective time-varying filter with a vector of frequencychannel weights for every short-time analysis frame, from the globalratio mask and global binary mask, and (5) applying the thresholdedratio mask and inverse short-time frequency transforms to selected onesof the frequency-domain signals to generate audio output signals fordriving the loudspeakers. Although the preferred processing involvesusing the thresholded ratio mask to produce the output, an effectivemachine hearing device could be built using only the global ratio mask.

There are existing methods, including adaptive beamformers such as theMultichannel Wiener Filter (MWF) and Minimum Variance DistortionlessResponse (MVDR) beamformers, that use past values (i.e., memory) tocompute a filter that can attenuate stationary sound sources; thesemethods are appropriate for attenuating the buzz of a refrigerator orthe hum of an engine, which are stationary sound sources that do nothave unpredictable spectrotemporal fluctations. The approach describedherein uses Short-Time Target Cancellation (STTC) processing to computea time-varying filter using only the data from short-time analysiswindows; it computes a time-varying filter, in the form of a vector offrequency channel weights for every analysis frame, using only the datafrom the current analysis frame. As such, it is causal, memoryless, iscapable of running in real time, and can be used to attenuate bothstationary and non-stationary sound sources.

The approach and devices described herein can attenuate interferingtalkers (i.e., non-stationary sound sources) using real-time processing.Another advantage of the approach described herein, relative to adaptivebeamformers such as the MWF and MVDR, is that the time-varying filtercomputed by the STTC processing is a set of frequency channel weightsthat can be applied independently to signals at the Left and Right ear,thereby enhancing speech intelligibility for a target talker while stillpreserving binaural cues for spatial hearing.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features and advantages will beapparent from the following description of particular embodiments of theinvention, as illustrated in the accompanying drawings in which likereference characters refer to the same parts throughout the differentviews.

FIG. 1 is a general block diagram of a system employing STTC processingfor improving speech intelligibility for a target talker;

FIG. 2 is a general block diagram of STTC processing;

FIG. 3 is a block diagram of audio circuitry of an assistive listeningdevice (ALD);

FIG. 4 is a depiction of a specialized eyeglass frame incorporatingcomponents of an ALD;

FIG. 5 is a plot of phase separations for microphone pairs of an ALD;

FIG. 6 is a block diagram of STTC processing for an ALD;

FIG. 7 is a plot for a ramped threshold used in STTC processing;

FIG. 8 is a depiction of a specialized eyeglass frame incorporatingcomponents of an ALD according to an alternative arrangement;

FIG. 9 is a demonstration figure with example Time-Frequency (T-F) masksfor a mixture of three concurrent talkers;

FIG. 10 is an illustration of causal and memoryless “frame by frame”processing;

FIG. 11 is a block diagram of alternative STTC processing for an ALD;

FIG. 12 is a depiction of a second example embodiment of an ALD;

FIG. 13 is a second plot of phase separations, for microphone pairs ofan ALD such as that of FIG. 12;

FIG. 14 is a block diagram of the alternative STTC processing used inthis second example embodiment of an ALD, such as that of FIG. 12;

FIG. 15 is a depiction of a third example embodiment of an ALD;

FIG. 16 is a block diagram of the alternative STTC processing used in athird example embodiment of an ALD, such as that of FIG. 15;

FIG. 17 (same as FIG. 9 in the original specification) is a blockdiagram of circuitry of a computerized device incorporating STTCprocessing for human-computer interface (HCI);

FIG. 18 (same as FIG. 10 in the original specification) is a depictionof a specialized computer incorporating microphone pairs for STTCprocessing;

FIG. 19 (same as FIG. 11 in the original specification) is a plot ofphase separations for microphone pairs of a specialized computer such asthat of FIG. 18 (i.e., FIG. 10 in the original specification);

FIG. 20 (same as FIG. 12 in the original specification) is a blockdiagram of STTC processing for a computerized device such as that ofFIG. 18 (i.e., FIG. 10 in the original specification);

FIG. 21 (same as FIG. 13 in the original specification) is a plot for analternative ramped threshold used in STTC processing;

FIG. 22 is a block diagram of alternative STTC processing for acomputerized device such as that of FIG. 18 (i.e., FIG. 10 in theoriginal specification);

FIG. 23 is a demonstration figure with example Time-Frequency (T-F)masks for a mixture of three concurrent talkers;

FIG. 24 (same as FIG. 14 in the original specification) is a blockdiagram of STTC processing for a binaural hearing aid;

FIG. 25 is a block diagram of alternative STTC processing for a binauralhearing aid;

FIG. 26 is a block diagram of alternative STTC processing for a “dualmonaural” binaural hearing aid;

FIG. 27 is a depiction of a binaural hearing aid (i.e., ALD)incorporating two pairs of microphones for STTC processing;

FIG. 28 is a third plot of phase separations, for microphone pairs of abinaural hearing aid such as that of FIG. 27;

FIG. 29 is a block diagram of alternative STTC processing for a binauralhearing aid such as that of FIG. 27;

FIG. 30 is a block diagram of alternative STTC processing for a binauralhearing aid such as that of FIG. 27;

DESCRIPTION OF EMBODIMENTS

FIG. 1 shows an audio system in generalized form, including microphones[10] having outputs coupled to audio circuitry [12]. In operation, themicrophones [10] respond to acoustic input of an immediate environmentthat includes a target talker [13-T] and one or more nontarget talkers[13-NT], generating respective audio signals [14]. These are supplied tothe audio circuitry [12], which applies short-time target cancellation(STTC) processing to enhance the intelligibility of the target talker[13-T] in the presence of the interfering non-target talkers [13-NT].Details of the STTC processing are provided herein.

The general arrangement of FIG. 1 may be realized in a variety of morespecific ways, two of which are described in some detail. In onerealization, the arrangement is incorporated into an assistive listeningdevice (ALD) or “hearing aid”, and in this realization the outputs [16]from the audio circuitry [12] are supplied to in-ear or near-earloudspeakers (not shown in FIG. 1). In another realization, thearrangement is used as initial or “front end” processing of ahuman-computer interface (HCI), and the outputs [16] conveynoise-reduced speech input to a machine hearing application (not shownin FIG. 1). Again, multiple realizations are possible.

FIG. 2 is a generalized description of the STTC processing [20] carriedout by the exemplary audio circuitry [12]. This processing [20] includesa set of short-time Fourier transforms (STFTs) [22], each applied to acorresponding input signal [14] from a corresponding microphone [10],and each generating a corresponding frequency-domain signal [24]. Theset of input signals [14] and the set of frequency-domain signals [24]are shown as x and X respectively. The STTC processing [20] furtherincludes a set of pair-wise mask calculations [26], each operating upona corresponding pair of the frequency-domain signals [24] and generatinga corresponding ratio mask (RM) [28] (the set of all ratio masks shownas RM). A combiner [30] combines the ratio masks [28] into an overallmask [32], which is provided to a scaler [34] along with a selection orcombination (Sel/Combo) [36] of the frequency-domain signals {X}. Theoutput of the scaler [34] is a noise-reduced frequency-domain signalsupplied to an inverse-STFT (I-STFT) [38] to generate the outputsignals) [16], shown as y.

Briefly, the selection/combination [36] may or may not include frequencydomain signals X that are also used in the pair-wise mask calculations[26]. In an ALD implementation as described more below, it may bebeneficial to apply the mask-controlled scaling [34] to signals fromnear-ear microphones that are separate from the microphones whoseoutputs are used in the pair-wise mask calculations [26]. Use of suchseparate near-ear microphones can help maintain important binaural cuesfor a user. In a computer-based implementation also described below, themask-controlled scaling [34] may be applied to a sum of the outputs ofthe same microphones whose signals are used to calculate the masks.

I. System Description of 6-Microphone Short-Time Target Cancellation(STTC) Assistive Listening Device (ALD).

FIGS. 3-8 show an embodiment of an assistive listening device (ALD)using 6-microphone STTC. As will be recognized, this provides oneversion of an effective ALD, however many variations are possible. FIG.3 is a block diagram of first audio circuitry [12-1] of the 6-microphoneALD. It includes a processor [30] performing first STTC processing[20-1], as well as signal conditioning circuitry [32]. The signalconditioning circuitry [32] interfaces the processor [30] with theseparate microphones and loudspeakers (not shown), and generallyincludes signal converters (digital to analog, analog to digital),amplifiers, analog filters, etc. as needed. In some embodiments, some orall of the conditioning circuitry [32] may be included with theprocessor [30] in a single integrated or hybrid circuit, and such aspecialized circuit may be referred to as a digital signal processor orDSP.

FIG. 4 shows an example physical realization of an assistive listeningdevice or ALD, specifically as a set of microphones and loudspeakersincorporated in an eyeglass frame [40] worn by a user. In thisrealization, the microphones [10] are realized using six forward-facingmicrophones [42] and two near-ear microphones [44-R], [44-L]. Theforward-facing microphones [42] are enumerated 1-6 as shown, andfunctionally arranged into pairs 1-2, 3-4 and 5-6, with respectivedistinct intra-pair spacings of 140 mm, 80 mm and 40 mm respectively inone embodiment. The near-ear microphones [44] are included in respectiveright and left earbuds [46-R], [46-L] along with corresponding in-earloudspeakers [48-R], [48-L].

Generally, the inputs from the six forward-facing microphones [42] areused to compute a Time-Frequency (T-F) mask (i.e. time-varying filter),which is used to attenuate non-target sound sources in the Left andRight near-ear microphones [44-L], [44-R]. The device boosts speechintelligibility for a target talker [13-T] from a designated lookdirection while preserving binaural cues that are important for spatialhearing.

The approach described herein avoids Interaural level Difference (ILD)compensation by integrating the microphone pairs [42] into the frame[40] of a pair of eyeglasses and giving them a forward facing half-omnidirectionality pattern; with this microphone placement, there iseffectively no ILD and thus no ILD processing is required. One downsideto this arrangement, if one were to use only these forward facingmicrophones, is the potential loss of access to both head shadow ILDcues and the spectral cues provided by the pinnae (external part ofears). However, such cues can be provided to the user by includingnear-ear microphones [44]. The forward-facing microphone pairs [42] areused to calculate a vector of frequency channel weights for eachshort-time analysis frame (i.e., a time-frequency mask); this vector offrequency channel weights is then used to filter the output of thenear-ear microphones [44]. Notably, the frequency channel weights foreach time slice may be applied independently to both the left and rightnear-ear microphones [44-L], [44-R], thereby preserving Interaural TimeDifference (ITD) cues, spectral cues, and the aforementioned ILD cues.Hence, the assistive listening device described herein can enhancespeech intelligibility for a target talker, while still preserving theuser's natural binaural cues, which are important for spatial hearingand spatial awareness.

It is noted that the ALD as described herein may be used in connectionwith separate Visually Guided Hearing Aid (VGHA) technology, in which aVGHA eyetracker can be used to specify a steerable “look” direction.Steering may be accomplished using shifts, implemented in either thetime domain or frequency domain, of the Left and Right signals. The STTCprocessing [20-1] boosts intelligibility for a target talker [13-T] inthe designated “look” direction and suppresses the intelligibility ofnon-target talkers (or distractors) [13-NT], all while preservingbinaural cues for spatial hearing.

STTC processing consists of a computationally efficient implementationof the target cancellation approach to sound source segregation, whichinvolves removing target talker sound energy and computing gainfunctions for T-F tiles according to the degree to which each T-F tileis dominated by energy from the target or interfering sound sources. TheSTTC processing uses subtraction in the frequency domain to implementtarget cancellation, using only the Short-Time Fourier Transforms(STFTs) of signals from microphones.

The STTC processing computes an estimate of the Ideal Ratio Mask (IRM),which has a transfer function equivalent to that of a time-varyingWiener filter; the IRM uses the ratio of signal (i.e., target speech)energy to mixture energy within each T-F unit:

$\begin{matrix}{{IR{M\left( {t,f} \right)}} = \frac{S^{2}\left( {t,f} \right)}{{S^{2}\left( {t,f} \right)} + {N^{2}\left( {t,f} \right)}}} & (1)\end{matrix}$

where S²(t, f) and N²(t, f), are the signal (i.e., target speech) energyand noise energy, respectively. The mixture energy is the sum of thesignal energy and noise energy.

The time-domain mixture x_(i) [m] of sound at the ith microphone iscomposed of both signal (s_(i)) and noise (η_(i)) components:

x _(i)[m]=s _(i)[m]+η_(i)[m]  (2)

Effecting sound source segregation amounts to an “unmixing” process thatremoves the noise (η) from the mixture (x) and computes an estimate (ŝ)of the signal (s). Whereas the IRM is computed using “oracle knowledge”access to both the “ground truth” signal (s_(i)) and the noise (η_(i))components, the STTC processing has access to only the mixture (x_(i))at each microphone. For every pair of microphones, the STTC processingcomputes both a Ratio Mask (RM) and a Binary Mask (BM) using only theSTFTs of the sound mixtures at each microphone. The STFT X_(i)[n,k] ofthe sound mixture x_(i)[m] at the ith microphone is as follows:

$\begin{matrix}{{X_{i}\left\lbrack {n,k} \right\rbrack} = {{STFT\left\{ {x_{i}\lbrack m\rbrack} \right\}} = {\sum\limits_{m = {- \infty}}^{\infty}{{x_{i}\lbrack m\rbrack}{w\left\lbrack {{nH} - m} \right\rbrack}e^{{- j}\frac{2\pi k}{F}m}}}}} & (3)\end{matrix}$

where w[n] is a finite-duration Hamming window; n and k are discreteindices for time and frequency, respectively; H is a temporal samplingfactor (i.e., the Hop size between FFTs) and F is a frequency samplingfactor (i.e., the FFT length).

The logic underlying the STTC processing involves computing an estimateof the noise (η), so as to subtract it from the mixture (x) and computean estimate (ŝ) of the signal (s). This filtering (i.e. subtraction ofthe noise) is effected through a T-F mask, which is computed via targetcancellation in the frequency domain using only the STFTs. The STTCprocessing consists of Short-Time Fourier Transform Magnitude (STFTM)computations, computed in parallel, that yield Mixture ({circumflex over(M)}) and Noise ({circumflex over (N)}) estimates that can be used toapproximate the IRM, and thereby compute a time-varying filter. TheMixture ({circumflex over (M)}), Noise ({circumflex over (N)}) andSignal (Ŝ) estimates for each T-F tile are computed as follows using thefrequency-domain signals (X_(i)) from a pair (i=[1, 2]) of microphones:

{circumflex over (M)}[n,k]=(|X ₁[n,k]|+|X ₂[n,k]|),  (4)

{circumflex over (N)}[n,k]=(|X ₁[n,k]−X ₂[n,k]|),  (5)

Ŝ[n,k]={circumflex over (M)}[n,k]−{circumflex over (N)}[n,k]  (6)

The processing described here assumes a target talker “straight ahead”at 0°. With the target-talker waveforms at the two microphones in phase(i.e., time-aligned) with each other, the cancellation process can beeffected via subtraction in either the time domain (e.g., x₁[m]−x₂[m])or the frequency domain, as in the Noise ({circumflex over (N)})estimate shown above.

The Noise estimate ({circumflex over (N)}) is computed by subtractingthe STFTs before taking their magnitude, thereby allowing phaseinteractions that cancel the target spectra. The Mixture ({circumflexover (M)}) estimate takes the respective STFT magnitudes beforeaddition, thereby preventing phase interactions that would otherwisecancel the target spectra. A Signal (Ŝ) estimate can be computed bysubtracting the Noise ({circumflex over (N)}) estimate from the Mixture({circumflex over (M)}) estimate. The processing described in thissection assumes a target talker “straight ahead” at 0°. However, the“look” direction can be “steered” via sample shifts implemented in thetime domain prior to T-F analysis. Alternatively, these “look” directionshifts could be implemented in the frequency domain.

Assuming a perfect cancellation of only the target (i.e., Signal)spectra, the {circumflex over (N)} term contains the spectra of allnon-target sound sources (i.e., Noise) in each T-F tile. The STTCprocessing uses the Mixture ({circumflex over (M)}) and Noise({circumflex over (N)}) STFTM computations to estimate the ratio ofSignal (Ŝ) (i.e., target) energy to mixture energy in every T-F tile:

$\begin{matrix}{{R{M\left\lbrack {n,k} \right\rbrack}} = {\frac{{\hat{M}\left\lbrack {n,k} \right\rbrack} - {\hat{N}\left\lbrack {n,k} \right\rbrack}}{\hat{M}\left\lbrack {n,k} \right\rbrack} = \frac{\hat{S}\left\lbrack {n,k} \right\rbrack}{{\hat{S}\left\lbrack {n,k} \right\rbrack} + {\hat{N}\left\lbrack {n,k} \right\rbrack}}}} & (7)\end{matrix}$

The Mixture ({circumflex over (M)}) and Noise ({circumflex over (N)})terms are short-time spectral magnitudes used to estimate the IRM formultiple frequency channels [k] in each analysis frame [n]. Theresulting Ratio Mask RM[n, k] is a vector of frequency channel weightsfor each analysis frame. RM[n, k] can be computed directly using theSTFTs of the signals from the microphone pair:

$\begin{matrix}{{R{M\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}} - {{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}} & (8)\end{matrix}$

A Binary Mask BM[n, k] may also be computed using a thresholdingfunction, with threshold value ψ, which may be set to a fixed value ofψ=0.2 for example:

$\begin{matrix}{{B{M\left\lbrack {n,k} \right\rbrack}} = \left\{ \begin{matrix}{1\ } & {{{if}\mspace{14mu} {{RM}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {\ {{{if}\mspace{20mu} {{RM}\left\lbrack {n,k} \right\rbrack}} < \psi}}\end{matrix} \right.} & (9)\end{matrix}$

FIG. 5 illustrates one aspect of the disclosed technique, namelyaddressing the problem of “null phase differences” that impairperformance within certain frequencies for any one pair of microphones.The top panel illustrates the phase separations of the three pairs ofmicrophones across the frequency range of 0 to 8 kHz, and for threedifferent interfering sound source directions (30°, 60° and 90°). Foreach microphone pair with respective intra-pair microphone spacing,there are frequencies at which there is little to no phase difference,such that target cancellation based on phase differences cannot beeffectively implemented. The disclosed technique employs multiplemicrophone pairs, with varied spacings, to address this issue.

In the illustrated example, three microphone pairs having respectivedistinct spacings (e.g. 140, 80 and 40 mm) are used, and their outputsare combined via “piecewise construction”, as illustrated in the bottompanel of FIG. 5; i.e., combined in a manner that provides positiveabsolute phase differences for the STTC processing to work with in the0-8 kHz band that is most important for speech intelligibility. Inparticular, this plot illustrates the “piecewise construction” approachto creating a chimeric Global Ratio Mask RM_(G) from the individualRatio Masks for the three microphone pairs ([1, 2], [3, 4], [5, 6]).This is described in additional detail below.

FIG. 6 is a block diagram of the STTC processing [20-1] (FIG. 3).Overall, it includes the following distinct stages of calculations:

1. Short-Time Fourier Transform (STFT) processing [50], converts eachmicrophone signal into frequency domain signal2. Ratio Mask (RM) and Binary Mask (BM) processing [52], applied tofrequency domain signals of microphone pairs3. Global Ratio Mask (RM_(G)) and Thresholded Ratio Mask (RM_(T))processing [54], uses ratio masks of all microphone pairs4. Output signal processing [56], uses the Thresholded Ratio Mask(RM_(T)) to scale/modify selected microphone signals to serve as outputsignal(s) [16]

The above stages of processing are described in further detail below.

1. STFT Processing [50]

Short-Time Fourier Transforms (STFTs) are continually calculated fromframes of each input signal x[m] according to the following calculation:

$\begin{matrix}{{X_{i}\left\lbrack {n,k} \right\rbrack} = {{STFT\left\{ {x_{i}\lbrack m\rbrack} \right\}} = {\sum\limits_{m = {- \infty}}^{\infty}{{x_{i}\lbrack m\rbrack}{w\left\lbrack {{nH} - m} \right\rbrack}e^{{- j}\frac{2\pi k}{F}m}}}}} & (10)\end{matrix}$

where i is the index of the microphone, w[n] is a finite-durationHamming window; n and k are discrete indices for time and frequency,respectively; H is a temporal sampling factor (i.e., the Hop sizebetween FFTs) and F is a frequency sampling factor (i.e., the FFTlength).

2. STTC Processing [52]

Pairwise ratio masks RM, one for each microphone spacing (140, 80 and 40mm) are calculated as follows; i.e., there is a unique RM for each pairof microphones ([1,2], [3,4], [5,6]):

$\begin{matrix}{{R{M_{1,2}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}} - {{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {11a} \right) \\{{{RM}_{3,4}\left\lbrack {n,k} \right\rbrack} = \frac{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}} - {{{X_{3}\left\lbrack {n,k} \right\rbrack} - {X_{4}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}}}} & \left( {11b} \right) \\{{R{M_{5,6}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}} - {{{X_{5}\left\lbrack {n,k} \right\rbrack} - {X_{6}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}}}} & \left( {11c} \right)\end{matrix}$

Pairwise Binary Masks BM are calculated as follows, using a thresholdingfunction IP, which in one example is a constant set to a relatively lowvalue (0.2 on a scale of 0 to 1):

$\begin{matrix}{{B{M_{1,2}\left\lbrack {n,k} \right\rbrack}} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {{RM}_{1,2}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {{{if}\mspace{14mu} {{RM}_{1,2}\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & \left( {12a} \right) \\{{{BM}_{3,4}\left\lbrack {n,k} \right\rbrack} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {{RM}_{3,4}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {{{if}\mspace{14mu} {{RM}_{3,4}\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & \left( {12b} \right) \\{{{BM}_{5,6}\left\lbrack {n,k} \right\rbrack} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {{RM}_{5,6}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {{{if}\mspace{14mu} {{RM}_{5,6}\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & \left( {12c} \right)\end{matrix}$

In the low frequency channels, a ramped binary mask threshold may beused for the most widely spaced microphone pair (BM_(1,2)) to addressthe issue of poor cancellation at these low frequencies. Thus at thelowest frequencies, where cancellation is least effective, a higherthreshold is used. An example of such a ramped threshold is describedbelow.

3. Global Ratio Mask (RM_(G)) and Thresholded Ratio Mask (RM_(T))Processing [54]

As mentioned above, a piecewise approach to creating a chimeric GlobalRatio Mask RM_(G) from the individual Ratio Masks for the threemicrophone pairs ([1,2], [3,4], [5,6]) is used. In one example, the RMGis constructed, in a piece-wise manner, thusly (see bottom panel of FIG.5):

RM_(G)[n, 1:32] = RM_(1, 2)[n, 1:32]( ≈ 0 → 1500  Hz)RM_(G)[n, 33:61] = RM_(3, 4)[n, 33:61]( ≈ 1500 → 3000  Hz)${R{M_{G}\left\lbrack {n,{62\text{:}\frac{F}{2}}} \right\rbrack}} = {R{M_{5,6}\left\lbrack {n,{62\text{:}\frac{F}{2}}} \right\rbrack}\left( {\approx 3000}\rightarrow{\frac{F_{S}}{2}{Hz}} \right)}$

The illustration of piecewise selection of discrete frequency channels(k) shown above is for a sampling frequency (Fs) of 50 kHz and an FFTsize (F) of 1024 samples; the discrete frequency channels used will varyaccording to the specified values of Fs and F. The piecewise-constructedGlobal Ratio Mask RM_(G) is also given conjugate symmetry (i.e. negativefrequencies are the mirror image of positive frequencies) to ensure thatthe STTC processing yields a real (rather than complex) output.Additional detail is given below.

A singular Global Binary Mask BM_(G) is computed from the three BinaryMasks (BM_(1,2), BM_(3,4), BM_(5,6)), where x specifies element-wisemultiplication:

BM_(G)[n,k]=BM_(1,2)[n,k]×BM_(3,4)[n,k]×BM_(5,6)[n,k]  (13)

Multiplication of the Global Ratio Mask RM_(G) with the Global BinaryMask BM_(G) yields a Thresholded Ratio Mask RM_(T)[n, k] that is usedfor reconstruction of the target signal in the output signal processing[56], as described below. Note that RM_(T)[n, k] has weights of 0 belowthe threshold ψ and continuous “soft” weights at and above ψ.

The Global Ratio Mask (RM_(G)), the Global Binary Mask (BM_(G)) and theThresholded Ratio Mask (RM_(T)) are all effective time-varying filters,with a vector of frequency channel weights for every analysis frame. Anyone of the three (i.e., RM_(G), BM_(G) or RM_(T)) can provide anintelligibility benefit for a target talker, and supress both stationaryand non-stationary interfering sound sources. RM_(T) is seen as the mostdesirable, effective and useful of the three; hence it is used forproducing the output in the block diagram shown in FIG. 6.

4. Output Signal Processing [56]

The output signal(s) may be either stereo or monaural (“mono”), andthese are created in correspondingly different ways as explained below.

Reconstruction of Target Signal with STEREO Output

Stereo output may be used, for example in applications such as ALD whereit is important to preserve binaural cues such as ILD, ITD. The outputof the STTC processing is an estimate of the target speech signal fromthe specified look direction. The Left and Right (i.e. stereo pair)Time-Frequency domain estimate (Y_(L)[n, k] and Y_(R) [n, k]) of thetarget speech signal (y_(L) [m] and y_(R)[m]) can be described thusly,where X_(L) and X_(R) are the Short Time Fourier Transforms (STFTs) ofthe signals x_(L) and x_(R), from the designated Left and Right in-earor near-ear microphones [44] (FIG. 4), and the Thresholded Ratio MaskRM_(T)[n, k] is the conjugate-symmetric mask (i.e. the set of short-timeweights for all frequencies, both positive and negative) computed in themask processing [54] as described above:

Y _(L)[n,k]=RM_(T)[n,k]×X _(L)[n,k]Y _(R)[n,k]=RM_(T)[n,k]×X_(R)[n,k]  (14)

Alternatively, the Global Ratio Mask (RM_(G)) could be used to producethe stereo output:

Y _(L)[n,k]=RM_(G)[n,k]×X _(L)[n,k]Y _(R)[n k]=RM_(G)[n,k]×X_(R)[n,k]  (15)

Synthesis of a stereo output (y_(L)[m] and y_(R)[m]) estimate of thetarget speech signal consists of taking the Inverse Short Time FourierTransforms (ISTFTs) of Y_(L)[n, k] and Y_(R)[n, k] and using theoverlap-add method of reconstruction.

While the Global Binary Mask BM_(G) could also be used to produce thestereo output, the continuously valued frequency channel weights of theRM_(G) and RM_(T) are more desirable, yielding superior performance inspeech intelligibility and speech quality performance than the BM_(G).RM_(T) is seen as the most desirable, effective and useful of the three;hence it is used for producing the output in the block diagram shown inFIG. 6. However, an effective system for enchancing speechintelligibility could be built using only RM_(G), hence the claimsection builds upon a system that uses RM_(G) to filter the output ofthe assistive listening device.

Reconstruction of Target Signal with MONO Output

A mono output (denoted below with the subscript M) may be used in otherapplications in which the preservation of binaural cues is absent orless important. In one example, a mono output can be computed via anaverage of the STFTs across multiple microphones, where I is the totalnumber of microphones:

$\begin{matrix}{{X_{M}\left\lbrack {n,k} \right\rbrack} = \frac{\sum\limits_{i = 1}^{I}{X_{i}\left\lbrack {n,k} \right\rbrack}}{I}} & (16) \\{{Y_{M}\left\lbrack {n,k} \right\rbrack} = {R{M_{T}\left\lbrack {n,k} \right\rbrack} \times {X_{M}\left\lbrack {n,k} \right\rbrack}}} & (17)\end{matrix}$

Alternatively, the Global Ratio Mask (RM_(G)) could be used to producethe mono output:

Y _(M)[n,k]=RM_(G)[n,k]×X _(M)[n,k]  (18)

The Mono output y_(M) [m] is produced by taking Inverse Short TimeFourier Transforms (ISTFT) of Y_(M) [n, k] and using the overlap-addmethod of reconstruction.

Steering the Nonlin-ear Beamformer's “Look” Direction

The default target sound source “look” direction is “straight ahead” at0°. However, if deemed necessary or useful, an eyetracker could be usedto specify the “look” direction, which could be “steered” via τ timeshifts, implemented in either the time or frequency domains, of the Leftand Right signals. The STTC processing could boost intelligibility forthe target talker from the designated “look” direction and suppress theintelligibility of the distractors, all while preserving binaural cuesfor spatial hearing.

The τ sample shifts are computed independently for each pair ofmicrophones, where F_(s) is the sampling rate, d is the inter-microphonespacing in meters, λ is the speed of sound in meters per second and θ isthe specified angular “look” direction in radians:

$\begin{matrix}{\tau_{\lbrack{1,2}\rbrack} = \left\lfloor {f_{s} \times \frac{d_{\lbrack{1,2}\rbrack}}{\lambda}{\sin (\theta)}} \right\rceil} & \left( {19a} \right) \\{\tau_{\lbrack{3,4}\rbrack} = \left\lfloor {f_{s} \times \frac{d_{\lbrack{3,4}\rbrack}}{\lambda}{\sin (\theta)}} \right\rceil} & \left( {19b} \right) \\{\tau_{\lbrack{5,6}\rbrack} = \left\lfloor {f_{s} \times \frac{d_{\lbrack{5,6}\rbrack}}{\lambda}{\sin (\theta)}} \right\rceil} & \left( {19c} \right)\end{matrix}$

These τ time shifts are used both for the computation of the Ratio Masks(RMs) as well as for steering the beamformer used for the Mono versionof the STTC processing.

FIG. 7 shows an example ramped threshold used to compute the Binary MaskBM_(1,2) for the most widely spaced pair of microphones, as mentionedabove. For frequencies below 2500 Hz, the threshold ramps lin-early.This ramped threshold for the 6-microphone array is somewhat moreaggressive than might be used in other embodiments, for example with an8-microphone array as described below. The use of a ramped thresholdimproves cancellation performance for distractors located at off-axisangles of approximately 30°.

FIG. 8 illustrates an alternative physical realization in which thenear-ear micron phones [44] are located on the temple pieces of theframe [40] rather than in the earbuds [60]. FIG. 8 shows only the rightnear-ear microphone [44-R]; a similar placement on the left temple pieceis used for the left near-ear microphone [44-L].

An STTC ALD as described herein can improve speech intelligibility for atarget talker while preserving Interaural Time Difference (ITD) andInteraural Level Difference (ILD) binaural cues that are important forspatial hearing. These binaural cues are not only important foreffecting sound source localization and segregation, they are importantfor a sense of Spatial Awareness. While the processing described hereinaims to eliminate the interfering sound sources altogether, the user ofthe STTC ALD device could choose whether to listen to the unprocessedwaveforms at the Left and Right near-ear microphones, the processedwaveforms, or some combination of both. The binaural cues that remainafter filtering with the Time-Frequency (T-F) mask are consistent withthe user's natural binaural cues, which allows for continued SpatialAwareness with a mixture of the processed and unprocessed waveforms. TheALD user might still want to hear what is going on in the surroundings,but will be able to turn the surrounding interferring sound sources downto a comfortable and ignorable, rather than distracting, intrusive andoverwhelming, sound level. For example, in some situations, it would behelpful to be able to make out the speech of surrounding talkers, eventhough the ALD user is primarily focused on listening to the persondirectly in front of them.

Brief Summary of the STTC Assistive Listening Device Embodiment of theInvention.

An Assistive Listening Device (ALD) embodiment of the claimed inventioncomputes a ratio mask in real-time using signals from microphones andFast Fourier Transforms (FFTs) thereof, and without any knowledge aboutthe noise source(s). As set forth in ¶0025-0045, the invention's RatioMask RM[n, k] can be computed using the Short-Time Fourier Transforms(STFTs) of signals from a microphone pair (e.g., i=[1, 2]):

$\begin{matrix}{{{RM}\left\lbrack {n,k} \right\rbrack} = \frac{\overset{\overset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}} - \overset{\overset{{Noise}\mspace{14mu} {estimate}\mspace{14mu} \hat{N}}{}}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}{\underset{\underset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}}} & (20)\end{matrix}$

The Mixture ({circumflex over (M)}) and Noise ({circumflex over (N)})terms are short-time spectral magnitudes used to estimate the IdealRatio Mask (IRM) for multiple frequency channels [k] in each analysisframe [n]. The resulting Ratio Mask RM[n, k] is a vector of frequencychannel weights for each analysis frame. An embodiment of the invention,an eyeglass-integrated assistive listening device, is shown in FIGS.4-6. Multiple pairwise ratio masks can be computed for multiplemicrophone pairs (e.g., [1,2], [3,4], [5,6]) with varied spacings. Achimeric Global Ratio Mask (RM_(G)) can be constructed in a piecewisemanner (see FIGS. 5 and 6), selecting a range of frequency channels fromthe individual pairwise ratio masks, so as to provide a positiveabsolute phase difference for the processing to work with.

Absolute phase differences for three microphone spacings (140, 80 and 40mm) and three Direction of Arrival (DOA) angles (±30, ±60, ±90) areplotted in the top row of FIG. 5. There is an interaction betweenfrequency, microphone spacing and Direction of Arrival angle (θ) thatyields wrapped [π, π] absolute phase differences of zero at specificfrequencies. Where the phase difference is at or near zero, the targetcancellation approach is ineffective, as the interfering sound sourcesare cancelled at these frequencies and thereby are erroneously includedin the frequency-domain signal estimate (Ŝ={circumflex over(M)}−{circumflex over (N)}). Multiple microphone pairs are used toovercome this null phase difference problem and thereby improveperformance. This is further illustrated in FIG. 9 for a mixture ofthree concurrent talkers (compare FIGS. 5, 6 and 9).

Example Time-Frequency (T-F) masks for a mixture of three talkers areshown in FIG. 9. The three concurrent talkers were at −60°, 0° and +60°,with all three talkers at equal loudness. The target talker was“straight ahead” at 0° and the two interfering talkers were to the leftand right at ±60°. The Ratio Masks from the three microphone pairs([1,2], [3,4] and [5,6]) are shown in the first three panels. For eachof these three Ratio Masks (RM_(1,2), RM_(3,4) and RM_(5,6)), there arefrequencies at which there is no phase difference between target andinterferer, resulting in bands of T-F tiles with (incorrect) values of(or near) “1” (see horizontal whitebands in the first three panels).However, multiple T-F masks from the three microphone pairs can beinterfaced to yield a T-F mask (fourth panel) that is similar inappearance to ideal masks (bottom panels) that are computed using“oracle knowledge” of the signal and noise components in the mixture. Inthis example, the Thresholded Ratio Mask (RM_(T)) is a post-processedvariant of the Global Ratio Mask (RM_(G)). Both the Global Ratio Mask(RM_(G)) and the Thresholded Ratio Mask (RM_(T)), (see FIG. 9) areeffective time-varying filters, with a vector of frequency channelweights for every analysis frame.

The processing computes multiple pairwise ratio masks for multiplemicrophone spacings (e.g., 140, 80 and 40 mm). Each of the three RatioMasks (RM_(1,2), RM_(3,4) and RM_(5,6)) has frequency bands where theT-F tiles are being overestimated (see horizontal white bands withvalues of “1” in FIG. 9). However, the multiple pairwise ratio masks canbe interfaced (FIGS. 5 and 6) to compute a chimeric (i.e., composite)T-F mask which can look similar to the Ideal Ratio Mask (IRM) (see FIG.9). Only the signals from the microphones (see FIG. 6) were used asinput, whereas the IRM, which has a transfer function equivalent to atime-varying Weiner filter, is granted access to the component Signal(S) and Noise (N) terms:

$\begin{matrix}{{{IRM}\left( {t,f} \right)} = \frac{S^{2}\left( {t,f} \right)}{{S^{2}\left( {t,f} \right)} + {N^{2}\left( {t,f} \right)}}} & (21)\end{matrix}$

where S²(t, f) and N²(t, f), are the signal (i.e., target speech) energyand noise energy, respectively; i.e., the Ideal Ratio Mask has “oracleknowledge” of the signal and noise components. The STTC ALD is capableof computing a T-F mask, in real-time, that is similar to the IRM (seeFIG. 9), and does so without requiring any information about the noisesource(s).

The hard problem here is not the static noise sources (think of theconstant hum of a refrigerator); the real challenge is competingtalkers, as speech has spectrotemporal variations that establishedapproaches have difficulty suppressing. Stationary noise has a spectrumthat does not change over time, whereas interfering speech, with itsspectrotemporal fluctuations, is an example of non-stationary noise.Because the assistive listening device computes a time-varying filter inreal-time, it can attenuate both stationary and non-stationary soundsources.

The invention employs causal and memoryless “frame-by-frame” processing;i.e., the T-F masks are computed using only the information from thecurrent short-time analysis frame. Because of this, it is suitable foruse in assistive listening device applications, which require causal andcomputationally efficient (i.e., FFT-based) low-latency (≤20 ms)processing. The assistive listening device's time-varying filtering,which can attenuate both stationary and non-stationary noise, can beapplied on a frame-by-frame basis to signals at the Left and Right ears,thereby effecting real-time (and low-latency) sound source segregationthat can enhance speech intelligibility for a target talker, while stillpreserving binaural cues for spatial hearing.

The audio circuitry of the invention operates on a frame-by-frame basis,with processing that is both causal and memoryless; i.e., it does notuse information from the future or the past. There are existing methodsthat can segregate competing talkers by computing a Time-Frequency (T-F)mask, which is effectively a time varying filter with a vector offrequency channel weights for every analysis frame. However, many ofthese methods, including Deep-Neural-Network (DNN) based approaches, usenoncausal block processing to compute T-F tiles for each analysis frame.In order for an assistive listening device to operate on a “frame byframe” basis, it cannot use data from the future. This is illustrated inFIG. 10 for an example grid of T-F tiles with sixteen discrete frequencychannels (k) and eleven short-time analysis frames (n); if theprocessing uses information from future T-F tiles (dark gray), it isnoncausal; likewise, if it uses information from the past (light gray),it is non-memoryless. Causal and memoryless processing would consist ofcomputing frequency channel (k) weights using data from only the currentanalysis frame (n).

These concerns regarding causality also relate to processing latenciesfor assistive listening devices. A device might violate the causalityrequirement by looking only a handful of frames into the future.However, one has to be mindful of the latency constraints; in order foran assistive listening device to be useful, the overall processing delaymust be ≤20 ms (i.e., 1/50th of a second) for closed-fit hearing aidsand ≤10 ms (i.e., 1/100th of a second) for open-fit hearing aids. If anassistive listening device were to look even just a few frames into thefuture, it would fail to meet these strict latency requirements.

Because the invention operates on a frame-by-frame basis, and the ratiomask computation requires only FFTs from microphone signals, theprocessing latency is determined by the length of the analysis window.An estimate of the processing latency is 2.5× the duration of theanalysis window; this takes into account the fact that the InverseShort-Time Fourier Transform (ISTFT) reconstruction requires two framesfor Overlap-Add (OLA). Hence, a 20 ms latency for the invention can beachieved by using an 8 ms analysis window; likewise, a 10 ms latency canbe achieved by using a 4 ms analysis window. The invention is capable ofrunning in real-time with low latency. Equation 22 below is a variationof Equation 8 (and Equation 20) that further illustrates that theframe-by-frame computation is effected with vectors of frequency channelweights (k). Those skilled in the art of audio signal processing willunderstand that the STFTs in equation 8 (and equation 20) can becomputed on a frame-by-frame basis using vectors (indicated by “:”) offrequency channel (k) values for every analysis frame (n):

$\begin{matrix}{{{RM}\left\lbrack {n,:} \right\rbrack} = \frac{\overset{\overset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,:} \right\rbrack}} + {{X_{2}\left\lbrack {n,:} \right\rbrack}}} - \overset{\overset{{Noise}\mspace{14mu} {estimate}\mspace{14mu} \hat{N}}{}}{{{X_{1}\left\lbrack {n,:} \right\rbrack} - {X_{2}\left\lbrack {n,:} \right\rbrack}}}}{\underset{\underset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,:} \right\rbrack}} + {{X_{2}\left\lbrack {n,:} \right\rbrack}}}}} & (22)\end{matrix}$

The invention computes a time-varying filter, in the form of a vector(:) of frequency channel (k) weights for every analysis frame (n), usingonly the data from the current analysis frame. As such, it is causal,memoryless, is capable of running in real time, and can be used toattenuate both stationary and non-stationary sound sources. Theinvention computes a real-time ratio mask, and does so with efficientlow-latency frame-by-frame processing.

Using a Phase Difference Normalization Vector (PDNV) to Scale the NoiseEstimate.

A variation on the processing described in ¶0025-0045 of this and theoriginal specification, and summarized herein in ¶0056-0064, involvesscaling the Noise estimate ({circumflex over (N)}) used to compute apairwise Ratio Mask (RM) by what is hereby referred to as adiscrete-frequency (k) dependent Phase Difference Normalization Vector(PDNV), denoted as Γ[k] in Equation 23 below:

$\begin{matrix}{{{RM}\left\lbrack {n,k} \right\rbrack} = \frac{\overset{\overset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}} - {\overset{\overset{PDNV}{}}{\Gamma \lbrack k\rbrack}\overset{\overset{{Noise}\mspace{14mu} {estimate}\mspace{14mu} \hat{N}}{}}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}}{\underset{\underset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}}} & (23)\end{matrix}$

Note that Γ[k] is discrete-frequency (k) dependent but is nottime-dependent, nor is it computed using signal values. For a knownmicrophone spacing, Γ[k] can be pre-computed so as to scale andnormalize the discrete-frequency (k) dependent elements of the Noiseestimate ({circumflex over (N)}) for each analysis frame n. The scalingof the Noise estimate ({circumflex over (N)}) by Γ[k] is effectedthrough element-wise multiplication, which is denoted by the symbol ⊙ inequation 24 below:

$\begin{matrix}{{{RM}\left\lbrack {n,:} \right\rbrack} = \frac{\overset{\overset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,:} \right\rbrack}} + {{X_{2}\left\lbrack {n,:} \right\rbrack}}} - {\overset{\overset{PDNV}{}}{\Gamma \lbrack:\rbrack} \odot \overset{\overset{{Noise}\mspace{14mu} {estimate}\mspace{14mu} \hat{N}}{}}{{{X_{1}\left\lbrack {n,:} \right\rbrack} - {X_{2}\left\lbrack {n,:} \right\rbrack}}}}}{\underset{\underset{{Mixture}\mspace{14mu} {estimate}\mspace{14mu} \hat{M}}{}}{{{X_{1}\left\lbrack {n,:} \right\rbrack}} + {{X_{2}\left\lbrack {n,:} \right\rbrack}}}}} & (23)\end{matrix}$

Those skilled in the art of audio signal processing will understand thatthe STFTs in equations 23 and 24 can be computed on a frame-by-framebasis using vectors (indicated by “:” in equation 24) of frequencychannel (k) values for every analysis frame (n). To summarize, thepairwise noise estimate ({circumflex over (N)}) used to compute apairwise ratio mask (RM) is scaled by a pre-computed frequency-dependentPhase Difference Normalization Vector (PDNV) Γ[k], which normalizes thenoise estimate ({circumflex over (N)}), at each discrete frequency (k),in a manner dependent on the value of the maximum possible phasedifference, at each discrete frequency (k), for a given microphone pairspacing.

A Phase Difference Normalization Vectors (PDNV) Γ[k] can be computed fora given microphone spacing. Assuming a distant sound source, the TimeDifference of Arrival (TDOA) for a sensor pair is computed as follows,where d is the distance in meters between the two microphones, λ is thespeed of sound in m/s and θ is the DOA angle in radians:

$\begin{matrix}{\tau = {\frac{d}{\lambda}{\sin (\theta)}}} & (25)\end{matrix}$

The corresponding wrapped absolute phase difference (ρ), as a functionof frequency (f) in Hz, and as plotted in the top row of FIG. 5, can becomputed as follows:

ρ(f)=|∠e ^(j2πfτ)|  (26)

where ∠ indicates the phase angle wrapped to the interval [−π, π].Likewise, the discrete-frequency wrapped absolute phase difference (

), as a function of discrete frequency (w_(k)), for a microphone pairspacing d, and a DOA angle θ in radians, can be computed as follows:

$\begin{matrix}{{\lbrack k\rbrack} = {{\angle e}^{J^{2\pi w_{k}\frac{d}{\lambda}si{n{(\theta)}}}}}} & (27)\end{matrix}$

A discrete-frequency Phase Difference Normalization Vector (PDNV) Γ[k]can be pre-computed, for a given microphone pair spacing (d), for agiven maximum possible angular separation (θ_(max)) in radians, and fora scaling parameter β (for now, β=1), as being equivalent to the inverseof the discrete-frequency wrapped absolute phase difference below agiven Frequency cutoff (F_(c)):

$\begin{matrix}{{\Gamma \lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{\lbrack k\rbrack} = \left( {{\angle e}^{j2\pi w_{k}\beta \frac{d}{\lambda}si{n{(\theta_{m\; {ax}})}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {F_{c}\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{20mu} \omega_{k}} > {F_{c}\mspace{14mu} {Hz}}}\end{matrix} \right.} & (28)\end{matrix}$

Below the pre-determined frequency cutoff F_(c), Γ[k] is inverselyproportional to the discrete-frequency wrapped absolute phase difference

(see equation 27) at the maximum possible angular separation of θ_(max).The pre-computed frequency-dependent PDNV Γ[k], is used to scale (i.e.,normalize) the Noise ({circumflex over (N)}) term in a manner dependenton the value of the maximum possible phase difference, at each discretefrequency (k), for a given microphone pair spacing.

Alternative STTC Processing [52] with Phase Difference Normalization

Pairwise ratio masks RM, one for each microphone spacing (140, 80 and 40mm) can also be calculated as follows; i.e., there is a unique RM foreach pair of microphones ([1,2], [3,4], [5,6]):

$\begin{matrix}{{R{M_{1,2}\left\lbrack {n,\ k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\;^{\lbrack{1,2}\rbrack}}\lbrack k\rbrack}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {29a} \right) \\{{R{M_{3,4}\left\lbrack {n,\ k} \right\rbrack}} = \frac{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\;^{\lbrack{3,4}\rbrack}}\lbrack k\rbrack}{{{X_{3}\left\lbrack {n,k} \right\rbrack} - {X_{4}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}}}} & \left( {29b} \right) \\{{R{M_{5,6}\left\lbrack {n,\ k} \right\rbrack}} = \frac{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\;^{\lbrack{5,6}\rbrack}}\lbrack k\rbrack}{{{X_{5}\left\lbrack {n,k} \right\rbrack} - {X_{6}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}}}} & \left( {29c} \right)\end{matrix}$

A pairwise Phase Difference Normalization Vector (PDNV) Γ[k], whichscales the respective pairwise Noise ({circumflex over (N)}) estimate,can be pre-computed for each microphone pair spacing:

$\begin{matrix}{{\Gamma_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left( {{\angle e}^{j2\pi w_{k}\beta \frac{d_{1,2}}{\lambda}si{n{(\theta_{m\; {ax}})}}}} \right)^{- 1}},} & {{{if}\mspace{9mu} \omega_{k}} \leq {1000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{9mu} \omega_{k}} > {1000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {30a} \right) \\{{\Gamma_{\lbrack{3,4}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{3,4}\rbrack}\lbrack k\rbrack} = \left( {{\angle e}^{j2\pi w_{k}\beta \frac{d_{3,4}}{\lambda}si{n{(\theta_{m\; {ax}})}}}} \right)^{- 1}},} & {{{if}\mspace{9mu} \omega_{k}} \leq {2000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{9mu} \omega_{k}} > {2000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {30b} \right) \\{{\Gamma_{\lbrack{3,4}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{5,6}\rbrack}\lbrack k\rbrack} = \left( {{\angle e}^{j2\pi w_{k}\beta \frac{d_{5,6}}{\lambda}si{n{(\theta_{m\; {ax}})}}}} \right)^{- 1}},} & {{{if}\mspace{9mu} \omega_{k}} \leq {4000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{9mu} \omega_{k}} > {4000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {30c} \right)\end{matrix}$

Below a pre-determined frequency cutoff, Γ[k] is inversely proportionalto the discrete-frequency wrapped absolute phase difference

(see equation 27) at a maximum possible angular separation of

$\theta_{m\; {ax}} = \frac{\pi}{2}$

radians. Although the PDNV Γ[k] can be equivalent to the inverse of

across all discrete frequencies w_(k), here Γ[k] is set to unity at andabove a pre-determined frequency cutoff (see equation 30). Thisalternative processing, for the STTC ALD “listening glasses” shown inFIG. 4, is illustrated in the block diagram in FIG. 11 (compare FIGS. 6and 11).

Alternative Embodiments of the STTC Assistive Listening Device (ALD).

Further theme and variation, with varied placement of the microphonesused to compute the pairwise ratio masks, is described below and shownin FIGS. 12 through 16. As with the first example embodiment of the STTCAssistive Listening Device (ALD), described on the previous pages, thepairwise Ratio Masks (RM) are computed using pairs of microphones, withvaried spacings, that are integrated into the frame of a pair ofeyeglasses.

FIG. 12 shows a second example physical realization of an assistivelistening device or ALD, specifically as a set of microphones andloudspeakers incorporated in an eyeglass frame [40] worn by a user. Inthis realization, the microphones [10] are realized using fourforward-facing microphones [42] and two near-ear microphones [44-R],[44-L]. The forward-facing microphones [42] are enumerated 1-4 as shown,and functionally arranged into pairs 1-2 and 3-4, with respectivedistinct intra-pair spacings of 120 mm and 50 mm respectively in thisembodiment. The near-ear microphones [44] are included in respectiveright and left earbuds [46-R], [46-L] along with corresponding in-earloudspeakers [48-R], [48-L].

Generally, the inputs from the four eyeglass-integrated microphones [42]are used to compute a Time-Frequency (T-F) mask (i.e. time-varyingfilter), which is used to attenuate non-target sound sources in the Leftand Right near-ear microphones [44-L], [44-R]. The device boosts speechintelligibility for a target talker [13-T] from a designated lookdirection while preserving binaural cues that are important for spatialhearing.

FIG. 13 illustrates one aspect of the disclosed technique, namelyaddressing the problem of “null phase differences” that impairperformance within certain frequencies for any one pair of microphones.The top panel illustrates the phase separations for two microphonespacings across the frequency range of 0 to 8 kHz, and for threedifferent interfering sound source directions (30°, 60° and 90°). Foreach microphone pair with respective intra-pair microphone spacing,there are frequencies at which there is little to no phase difference,such that target cancellation based on phase differences cannot beeffectively implemented. The disclosed technique employs multiplemicrophone pairs, with varied spacings, to address this issue.

In the illustrated example shown in FIG. 13, two microphone pairs havingrespective distinct spacings (e.g. 120 and 50 mm) are used, and theiroutputs are combined via “piecewise construction”, as illustrated in thebottom panel of FIG. 13; i.e., combined in a manner that providespositive absolute phase differences for the STTC processing to work within the 0-8 kHz band that is most important for speech intelligibility.In particular, this plot illustrates the “piecewise construction”approach to creating a chimeric Global Ratio Mask RM_(G) from theindividual Ratio Masks for the two microphone pairs ([1, 2], [3, 4]).

FIG. 14 is a block diagram of the alternative STTC processing used inthis second example embodiment of an ALD. Overall, it includes thefollowing distinct stages of calculations:

1. Short-Time Fourier Transform (STFT) processing [50], converts eachmicrophone signal into frequency domain signal2. Ratio Mask (RM) processing [52], applied to frequency domain signalsof microphone pairs3. Piecewise Construction of a Global Ratio Mask (RM_(G)) [54]processing, uses ratio masks of all microphone pairs4. Output signal processing [56], uses the Global Ratio Mask (RM_(G)),or a post-processed variant thereof, to scale/modify selected microphonesignals to serve as output signal(s) [16]

In this second example embodiment of the STTC ALD, alternative STTCprocessing, post-processing and time-domain signal reconstruction isillustrated in FIG. 14. Each of the two microphone pairs ([1,2], [3,4])yields a Ratio Mask (RM_(1,2) and RM_(3,4)). The chimeric Global RatioMask RM_(G) has the 0 to 2 kHz (i.e., “low to mid”) frequency channelsfrom RM_(1,2) and the 2 khz to F/2 (i.e., “mid to high”) frequencychannels from RM_(3,4). RM_(G) is smoothed along the frequency axis toyield the Smoothed Ratio Mask RM_(S). Hence, RM_(S) is a post-processedvariant of RM_(G). Either RM_(G) or RM_(S) (i.e., the smoothing alongthe frequency axis step is optional) can be used to attenuate theinterfering (i.e., non-target) talkers in the binaural sound mixtures(x_(L) and x_(R)) from microphones in the Left and Right ears, therebyeffecting real-time sound source segregation (and a predictedenhancement of target talker speech intelligibility) while stillpreserving binaural cues for spatial hearing.

The alternative STTC processing (FIG. 14) for this second exampleembodiment of an STTC ALD uses two pairs of microphones with variedspacing (120 and 50 mm); each of these two microphone spacings is freefrom null phase differences within a different range of frequencies(FIG. 13). A “piecewise construction” approach to avoiding null phasedifferences is illustrated in the bottom row of FIG. 13. The approachdescribed herein uses two pairs ([1,2] and [3, 4]) to compute two RatioMasks (RM_([1,2]) and RM_([3,4])):

$\begin{matrix}{{R{M_{1,2}\left\lbrack {n,\ k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\;^{\lbrack{1,2}\rbrack}}\lbrack k\rbrack}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {31a} \right) \\{{R{M_{3,4}\left\lbrack {n,\ k} \right\rbrack}} = \frac{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\;^{\lbrack{3,4}\rbrack}}\lbrack k\rbrack}{{{X_{3}\left\lbrack {n,k} \right\rbrack} - {X_{4}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}}}} & \left( {31b} \right)\end{matrix}$

A pairwise Phase Difference Normalization Vector (PDNV) Γ[k], whichscales the respective Noise {circumflex over (N)} terms, can bepre-computed for each microphone pair spacing:

$\begin{matrix}{{\Gamma_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left( {{\angle e}^{j2\pi w_{k}\beta \frac{d_{1,2}}{\lambda}si{n{(\theta_{m\; {ax}})}}}} \right)^{- 1}},} & {{{if}\mspace{9mu} \omega_{k}} \leq {1250\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{9mu} \omega_{k}} > {1250\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {32a} \right) \\{{\Gamma_{\lbrack{3,4}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{3,4}\rbrack}\lbrack k\rbrack} = \left( {{\angle e}^{j2\pi w_{k}\beta \frac{d_{3,4}}{\lambda}si{n{(\theta_{m\; {ax}})}}}} \right)^{- 1}},} & {{{if}\mspace{9mu} \omega_{k}} \leq {3250\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{9mu} \omega_{k}} > {3250\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {32b} \right)\end{matrix}$

Below a pre-determined frequency cutoff, the pairwise Γ[k] is inverselyproportional to the discrete-frequency wrapped absolute phase difference

(see equation 27) at the maximum possible angular separation ofθ_(max)=π/2 radians. The frequency dependent PDNV Γ[k], is used to scale(or normalize) the Noise ({circumflex over (N)}) term according to howlittle phase difference is available at each discrete frequency w_(k).This helps alleviate the problem of having very little phase difference,for the STTC processing to work with, at relatively low frequencies.Although the PDNV Γ[k] can be equivalent to the inverse of

across all discrete frequencies w_(k), here Γ[k] is set to unity at andabove a pre-determined frequency (see equation 32).

The two eyeglass-integrated microphone pairs ([1, 2], [3, 4]) yield twounique ratio masks (RM_(1,2), RM_(3,4)), which are interfaced with eachother so as to provide a positive absolute phase difference for STTCprocessing to work with (see bottom row of FIG. 13). The “piecewiseconstruction” approach to creating a chimeric Global Ratio Mask RM_(G)from the individual Ratio Masks for the two microphone pairs ([1,2],[3,4]) is illustrated in FIGS. 13 and 14. RM_(G) can be constructed, ina piece-wise manner, as follows when using a sampling rate of F_(s)=32kHz and short-time analysis windows of 4 ms duration:

RM_(G)[n, 1 : 8] = RM_(1, 2)[n, 1 : 8]  ( ≈ 0− > 2000  Hz)${{RM}_{G}\left\lbrack {n,{9:\frac{F}{2}}} \right\rbrack} = {{{RM}_{3,4}\left\lbrack {n,{9:\frac{F}{2}}} \right\rbrack}\mspace{14mu} \left( {{\approx 2000}->{\frac{F_{S}}{2}\mspace{14mu} {Hz}}} \right)}$RM_(G)[n, k] = RM_(G)[n, k]⁺

The positive exponent (i.e., RM_(G)[n, k]⁺) indicates that any negativeT-F values in RM_(G) are set to zero. The piecewise-constructed GlobalRatio Mask RM_(G) is also given conjugate symmetry (i.e., negativefrequencies are the mirror image of positive frequencies). This ensuresthat the processing yields a real (rather than complex) output.

Because of the fundamental tradeoff between spectral and temporalresolution, when using a relatively short analysis window, theresolution along discrete-frequency can be rather course, whichunfortunately can result in rather subpar and unpleasant speech quality.However, the speech quality can be improved by “Channel Weighting”,which consists of smoothing along the frequency axis. This “frequencysmoothing” can be effected in various ways, for example through use of amean filter or convolution with a gammatone weighting function. Whenusing relatively long analysis windows, this post-processing step is notnecessary or useful. However, when using relatively short analysiswindows, this “Channel Weighting” (i.e., smoothing along the frequencyaxis) post-processing step can noticeably improve speech quality. Asillustrated in FIG. 14, RM_(G), is smoothed along the frequency axis toyield the Smoothed Ratio Mask RM_(S). Hence, RM_(S) is a post-processedvariant of RM_(G). Either RM_(G) or RM_(S) (i.e., the channel weightingstep is optional) can be used to attenuate the interfering (i.e.,non-target) talkers in the binaural sound mixtures (x_(L) and x_(R))from microphones in the Left and Right ears.

The output of the STTC processing is an estimate of the target speechsignal from the specified look direction. The Left and Right (i.e.stereo pair) Time-Frequency domain estimates (

_(L)[n, k] and

_(R)[n, k]) of the target speech signal can be described thusly, whereX_(L) and X_(R) are the Short Time Fourier Transforms (STFTs) of thesignals x_(L) and x_(R), from the designated Left and Right microphones,and RM_(S)[n, k] is the conjugate-symmetric Smoothed Ratio Mask (i.e.,the set of short-time weights for all frequencies, both positive andnegative):

_(L)[n,k]=RM_(S)[n,k]×X _(L)[n,k]

_(R)[n,k]=RM_(S)[n,k]×X _(R)[n,k]  (33)

Those skilled in the art of audio signal processing will understand thatRM_(G), or any post-processed variant thereof, can be used to computethe output of STTC processing:

_(L)[n,k]=RM_(G)[n,k]×X _(L)[n,k]

_(R)[n,k]=RM_(G)[n,k]×X _(R)[n,k]  (34)

A user-defined “mix” parameter α would allow the user of an STTC“Assistive Listening Device” to determine the ratio of processed andunprocessed output. With α=0, only unprocessed output would be heard,whereas with α=1 only processed (i.e., the output of the STTC processingdescribed herein) would be heard. At intermediate values, a user-definedideal mix of processed and unprocessed output could be defined by theuser, either beforehand or online using a smartphone application. Thefrequency-domain stereo output ([Y_(L), Y_(R)]) would thus be someuser-defined mixture of processed ([

_(L),

_(R)]) and unprocessed ([X_(L), X_(R)]) audio:

Y _(L)[n,k]=α

_(R)[n,k]+(1−α)X _(R)[n,k]

Y _(R)[n,k]=α

_(R)[n,k]+(1−α)X _(R)[n,k]

Synthesis of a stereo output (y_(L) [m] and y_(R)[m]) estimate of thetarget speech signal consists of taking the Inverse Short Time FourierTransforms (ISTFTs) of Y_(L)[n, k] and Y_(R)[n, k] and using theoverlap-add method of reconstruction. Alternative processing wouldinvolve using RM_(S) as a postfilter for a fixed and/or adaptivebeamformer, and giving the user control over the combination of STTCprocessing, beamforming, and unprocessed audio.

FIG. 15 shows a third example physical realization of an assistivelistening device or ALD, specifically as a set of microphones andloudspeakers incorporated in an eyeglass frame [40] worn by a user. Inthis realization, the microphones [10] are realized using foureyeglass-integrated microphones [42], arranged on the left temple piece(i.e., stem) of the eyeglass frames, and two near-ear microphones[44-R], [44-L]. The four eyeglass-integrated microphones [42] areenumerated 1-4 as shown, and functionally arranged into three pairs 1-2,1-3 and 1-4, with respective distinct intra-pair spacings of 21.5 mm, 43mm and 64.5 mm respectively, in this embodiment. The near-earmicrophones [44] are included in respective right and left earbuds[46-R], [46-L] along with corresponding in-ear loudspeakers [48-R],[48-L].

Generally, the inputs from the four eyeglass-integrated microphones [42]are used to compute a Time-Frequency (T-F) mask (i.e. time-varyingfilter), which is used to attenuate non-target sound sources in the Leftand Right near-ear microphones [44-L], [44-R]. The device boosts speechintelligibility for a target talker [13-T] from a designated lookdirection while preserving binaural cues that are important for spatialhearing.

FIG. 16 is a block diagram of the alternative STTC processing used inthis third example embodiment of an ALD. Overall, it includes thefollowing distinct stages of calculations:

1. Short-Time Fourier Transform (STFT) processing [50], converts eachmicrophone signal into frequency domain signal2. Ratio Mask (RM) processing [52], applied to frequency domain signalsof microphone pairs3. Piecewise Construction of a Global Ratio Mask (RM_(G)) [54]processing, uses ratio masks of all microphone pairs4. Output signal processing, uses the Global Ratio Mask (RM_(G)), or apost-processed variant thereof, to scale/modify selected microphonesignals to serve as output signal(s) (as in FIG. 14)

In this third example embodiment (see FIGS. 15 and 16) τ sample shifts,as described in ¶0051 herein and in the original specification, are usedto steer the “look” direction of the eyeglass-integrated microphones by90° (equivalent to

$\theta = \frac{\pi}{2}$

radians); i.e., so as to steer the “look” direction towards a targetdirectly in front of the ALD user.

The τ sample shifts are computed independently for each pair ofmicrophones, where F_(s) is the sampling rate, d is the inter-microphonespacing in meters, λ is the speed of sound in meters per second and θ isthe specified angular “look” direction in radians (here

$\begin{matrix}{\left. {\theta = \frac{\pi}{2}} \right):} & \;\end{matrix}$

$\begin{matrix}{\tau_{\lbrack{1,2}\rbrack} = \left\lfloor {f_{s} \times \frac{d_{\lbrack{1,2}\rbrack}}{\lambda}\sin \; (\theta)} \right\rceil} & \left( {36a} \right) \\{\tau_{\lbrack{1,3}\rbrack} = \left\lfloor {f_{s} \times \frac{d_{\lbrack{1,3}\rbrack}}{\lambda}\sin \; (\theta)} \right\rceil} & \left( {36b} \right) \\{\tau_{\lbrack{1,4}\rbrack} = \left\lfloor {f_{s} \times \frac{d_{\lbrack{1,4}\rbrack}}{\lambda}\sin \; (\theta)} \right\rceil} & \left( {36c} \right)\end{matrix}$

Because here we are shifting the “look” direction by 90° (i.e.,

$\left. {\theta = \frac{\pi}{2}} \right)$

via these pairwise τ sample shifts, it is in this case necessary tomodify the computation of the discrete-frequency wrapped absolutewrapped phase difference (

) so as to incorporate a scaling parameter β; here β=2.

A modified discrete-frequency wrapped absolute phase difference (

), as a function of, discrete frequency (w_(k)) in Hz, DOA angle θ inradians, and here with a scaling parameter of β=2, can be computed asfollows, where d is the microphone pair spacing in meters:

$\begin{matrix}{{\lbrack k\rbrack} = {{\angle \; e^{j2\pi w_{k}\beta \frac{d}{\lambda}si{n{(\theta)}}}}}} & (37)\end{matrix}$

A pairwise discrete-frequency Phase Difference Normalization Vector(PDNV) Γ[k] can be precomputed, for a given microphone pair spacing (d),and for a given maximum possible angular separation (θm_(ax)) inradians, as being equivalent to the inverse of the discrete-frequencywrapped absolute phase difference below a given Frequency cutoff(F_(c)):

$\begin{matrix}{{\Gamma \lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{P\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {F_{c}{Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {F_{c}{Hz}}}\end{matrix} \right.} & (38)\end{matrix}$

Below the pre-determined frequency cutoff F_(c), Γ[k] is inverselyproportional to the discrete-frequency wrapped absolute phase difference

(see equation 37) at the maximum possible angular separation of θ_(max).The pre-computed frequency-dependent PDNV Γ[k], is used to scale (i.e.,normalize) the Noise ({circumflex over (N)}) term in a manner dependenton the value of the maximum possible phase difference, at each discretefrequency (k), for a given microphone pair spacing.

As illustrated on the left hand side of FIG. 16, the τ sample shifts areused to delay x₁[m] before computing three different variants of X₁ [n,k]; although the same X₁ [n, k] notation is used for all three RatioMask (RM) computations, X₁[n, k] is in this case a local variable,computed uniquely for each of the three RM computations, because x₁ [m]is shifted by three different τ sample shifts (τ_([1,2]), τ_([1,3]),τ_([1,4])) before the STFT stage that yields X₁ [n,k].

In this third example embodiment of the STTC ALD, alternative STTCprocessing is illustrated in FIG. 16. Each of the three microphone pairs([1,2], [1,3], [1,4]) yields a Ratio Mask (RM_(1,2) RM_(1,3) andRM_(1,4)). Here the chimeric Global Ratio Mask RM_(G) has the 0 to 1.5kHz (i.e., “low to mid”) frequency channels from RM_(1,4), the 1.5 kHzto 3 kHz (i.e., “mid”) frequency channels from RM_(1,3) and the 3 khz toF/2 (i.e., “mid to high”) frequency channels from RM_(1,2).

The alternative STTC processing (FIG. 16) for this third exampleembodiment of an STTC ALD uses three pairs of microphones with variedspacing (21.5, 43 and 64.5 mm); each of these three microphone spacingsis free from null phase differences within a different range offrequencies. The approach described herein uses three microphone pairs([1, 2], [1,3] and [1, 4]) to compute three Ratio Masks (RM_([1,2]),RM_([1,3]) and RM_([1,4])).

$\begin{matrix}{{R{M_{1,2}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{1,2}\rbrack}\lbrack k\rbrack}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {39a} \right) \\{{R{M_{1,3}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{3}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{1,3}\rbrack}\lbrack k\rbrack}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{3}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{3}\left\lbrack {n,k} \right\rbrack}}}} & \left( {39b} \right) \\{{R{M_{1,4}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{1,4}\rbrack}\lbrack k\rbrack}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{4}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}}}} & \left( {39c} \right)\end{matrix}$

A pairwise Phase Difference Normalization Vector (PDNV) Γ[k], whichscales the respective pairwise Noise ({circumflex over (N)}) estimate,can be pre-computed for each microphone pair spacing, using the modifiedPDNV computation in ¶0088 that incorporates a parameter β (here β=2):

$\begin{matrix}{{\Gamma_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d_{1,2}}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {1000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {1000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {40a} \right) \\{{\Gamma_{\lbrack{1,3}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{1,3}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d_{1,3}}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {2000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {2000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {40b} \right) \\{{\Gamma_{\lbrack{1,4}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{1,4}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d_{1,4}}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {4000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {4000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {40c} \right)\end{matrix}$

Below a pre-determined frequency cutoff, Γ[k] is inversely proportionalto the discrete-frequency wrapped absolute phase difference

(see equation 37) at the maximum possible angular separation of 0=π/2radians. The frequency dependent PDNV Γ[k], is used to scale (ornormalize) the Noise ({circumflex over (N)}) term according to howlittle phase difference is available at each discrete frequency w_(k).This helps alleviate the problem of having very little phase difference,for the STTC processing to work with, at relatively low frequencies.Although the PDNV Γ[k] can be equivalent to the inverse of

across all discrete frequencies w_(k), here Γ[k] is set to unity at andabove a pre-determined frequency (see equation 40).

The three eyeglass-integrated microphone pairs ([1, 2], [1, 3], [1, 4])yield pairwise ratio masks (RM_(1,2), RM_(1,3), RM_(1,4)), which areinterfaced with each other to construct the chimeric Global Ratio Mask(RM_(G)), which can be constructed via “Piecewise Construction” asfollows when using a sampling rate of F_(s)=32 kHz and short-timeanalysis windows of 4 ms duration:

RM_(G)[n, 1:6] = RM_(1, 4)[n, 1:6]  ( ≈ 0 → 1500  Hz)RM_(G)[n, 7:12] = RM_(1, 3)[n, 7:2]  ( ≈ 1500 → 3000  Hz)${{RM}_{G}\left\lbrack {n,{13\text{:}\frac{F}{2}}} \right\rbrack} = {{{RM}_{1,2}\left\lbrack {n,{13\text{:}\frac{F}{2}}} \right\rbrack}\mspace{14mu} \left( {\approx 3000}\rightarrow{\frac{F_{S}}{2}\mspace{14mu} {Hz}} \right)}$

RM_(G)[n, k]=RM_(G)[n, k]⁺. The positive exponent (i.e., RM_(G)[n, k]⁺)indicates that any negative T-F values in RM_(G) are set to zero. Thepiecewise-constructed Global Ratio Mask RM_(G) is also given conjugatesymmetry (i.e., negative frequencies are the mirror image of positivefrequencies). This ensures that the processing yields a real (ratherthan complex) output.

II. System Description of 8-Microphone Short-Time Target Cancellation(STTC) Human-Computer Interface (HCI)

FIGS. 17-21 show a second embodiment of a computerized realization using8 microphones. The STTC processing serves as a front end to a computerhearing application such as automatic speech recognition (ASR). Becausemuch of the processing is the same or similar as that of a 6-microphonesystem as described above, the description of FIGS. 17-21 is limited tohighlighting the key differences from corresponding aspects of the6-microphone system.

FIG. 17 is a block diagram of a specialized computer that realizes theSTTC functionality. It includes one or more processors [70], primarymemory [72], I/O interface circuitry [74], and secondary storage [76]all interconnected by high-speed interconnect [78] such as one or morehigh-bandwidth internal buses. The I/O interface circuitry [74]interfaces to external devices including the input microphones, perhapsthrough integral or non-integral analog-to-digital converters. Inoperation, the memory [72] stores computer program instructions ofapplication programs as well as an operating system, as generally known.In this case, the application programs include STTC processing [20-2] aswell as a machine hearing application (M-H APP) [80]. The remainingdescription focuses on structure and operation of the STTC processing[20-2], which generates noise-reduced output audio signals [16] (FIG. 1)supplied to the machine hearing application [80].

FIG. 18 shows a physical realization of a computer structured accordingto FIG. 17, in this case in the form of a laptop computer [90] having anarray of eight microphones [92] integrated into an upper part of itscasing as shown. The four pairs ([1, 2], [3, 4], [5, 6], [7, 8]) ofmicrophones have respective distinct spacings of 320, 160, 80 and 40 mm,respectively.

FIG. 19 is a set of plots of phase separations for the 8-microphonearray, analogous to that of FIG. 5 for the 6-microphone array. Thebottom panel illustrates a piecewise approach to creating the GlobalRatio Mask RM_(G) from the individual Ratio Masks for the fourmicrophone pairs ([1, 2], [3, 4], [5, 6], [7, 8]). This is described inadditional detail below.

FIG. 20 is a block diagram of the STTC processing [20-2] (FIG. 17),analogous to FIG. 6 described above. It includes the following distinctstages of calculations, similar to the processing of FIG. 6 except foruse of four rather than three microphone pairs:

1. Short-Time Fourier Transform (STFT) processing [90], converts eachmicrophone signal into frequency domain signal. 2. Ratio Mask (RM) andBinary Mask (BM) processing [92], applied to frequency domain signals ofmicrophone pairs. 3. Global Ratio Mask (RM_(G)) and Thresholded RatioMask (RM_(T)) processing [94], uses ratio masks of all microphone pairs.4. Output signal processing [96], uses the Thresholded Ratio Mask(RM_(T)) to scale/modify selected microphone signals to serve as outputsignal(s) [16].

In the STFT processing [90], individual STFT calculations [90] are thesame as above. Two additional STFTs are calculated for the 4thmicrophone pair (7,8). In the RM processing [92], a fourth RM_(7,8) iscalculated for the fourth microphone pair:

$\begin{matrix}{{R{M_{1,2}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}} - {{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {41a} \right) \\{{R{M_{3,4}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}} - {{{X_{3}\left\lbrack {n,k} \right\rbrack} - {X_{4}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}}}} & \left( {41b} \right) \\{{R{M_{5,6}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}} - {{{X_{5}\left\lbrack {n,k} \right\rbrack} - {X_{6}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}}}} & \left( {41c} \right) \\{{R{M_{7,8}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{7}\left\lbrack {n,k} \right\rbrack}} + {{X_{8}\left\lbrack {n,k} \right\rbrack}} - {{{X_{7}\left\lbrack {n,k} \right\rbrack} - {X_{8}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{7}\left\lbrack {n,k} \right\rbrack}} + {{X_{8}\left\lbrack {n,k} \right\rbrack}}}} & \left( {41d} \right)\end{matrix}$

Also, as shown in the bottom panel of FIG. 19, piecewise construction ofthe global ratio mask RM_(G) uses the four RMs as follows (using Fs=50kHz and F=1024 for the examples herein):

RM_(G)[n, 1:16] = RM_(1, 2)[n, 1:16]  ( ≈ 0 → 750  Hz)RM_(G)[n, 17:32] = RM_(3, 4)[n, 17:32]  ( ≈ 750 → 1500  Hz)RM_(G)[n, 33:61] = RM_(5, 6)[n, 33:61]  ( ≈ 1500 → 3000  Hz)${{RM}_{G}\left\lbrack {n,{62\text{:}\frac{F}{2}}} \right\rbrack} = {{{RM}_{7,8}\left\lbrack {n,{62\text{:}\frac{F}{2}}} \right\rbrack}\mspace{14mu} \left( {\approx 3000}\rightarrow{\frac{F_{S}}{2}\mspace{14mu} {Hz}} \right)}$

Similarly, the pairwise BM calculations include calculation of a fourthBinary Mask, BM_(7,8), for the fourth microphone pair [7, 8]:

$\begin{matrix}{{B{M_{1,2}\left\lbrack {n,k} \right\rbrack}} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {{RM}_{1,2}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {{{if}\mspace{14mu} {{RM}_{1,2}\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & \left( {42a} \right) \\{{B{M_{3,4}\left\lbrack {n,k} \right\rbrack}} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {{RM}_{3,4}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {{{if}\mspace{14mu} {{RM}_{3,4}\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & \left( {42b} \right) \\{{B{M_{5,6}\left\lbrack {n,k} \right\rbrack}} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {{RM}_{5,6}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {{{if}\mspace{14mu} {{RM}_{5,6}\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & \left( {42c} \right) \\{{B{M_{7,8}\left\lbrack {n,k} \right\rbrack}} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {{RM}_{7,8}\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\0 & {{{if}\mspace{14mu} {{RM}_{7,8}\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & \left( {42d} \right)\end{matrix}$

And the Global Binary Mask BM_(G) uses all four BMs:

BM_(G)[n,k]=BM_(1,2)[n,k]×BM_(3,4)[n,k]×BM_(5,6)[n,k]×BM_(7,8)[n,k]  (43)

FIG. 21 shows the less aggressively ramped threshold used for the BMcalculations. For frequencies below 1250 Hz, the threshold rampslin-early.

For the Output Signal Reconstruction [96], both stereo and monoalternatives are possible. These are generally similar to those of FIG.6, except that the stereo version filters the signals from the thirdmicrophone pair (3,4). The mono version combines the outputs of alleight microphone signals:

$\begin{matrix}{{X_{M}\left\lbrack {n,k} \right\rbrack} = \frac{\sum\limits_{i = 1}^{I}{X_{i}\left\lbrack {n,k} \right\rbrack}}{I}} & (44) \\{{Y_{M}\left\lbrack {n,k} \right\rbrack} = {{{RM}_{T}\left\lbrack {n,k} \right\rbrack} \times {X_{M}\left\lbrack {n,k} \right\rbrack}}} & (45)\end{matrix}$

Alternative STTC HCI Processing [52] with Phase DifferenceNormalization.

Pairwise ratio masks RM, one for each microphone spacing (320, 160, 80and 40 mm) can also be calculated as follows, using the Phase DifferenceNormalization Vectors (PDNV) described in ¶0065-0068; there is a uniqueRM for each pair of microphones ([1,2], [3,4], [5,6], [7,8]):

$\begin{matrix}{{R{M_{1,2}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{1,2}\rbrack}\lbrack k\rbrack}{{{X_{1}\left\lbrack {n,k} \right\rbrack} - {X_{2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{1}\left\lbrack {n,k} \right\rbrack}} + {{X_{2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {46a} \right) \\{{R{M_{3,4}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{3,4}\rbrack}\lbrack k\rbrack}{{{X_{3}\left\lbrack {n,k} \right\rbrack} - {X_{4}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{3}\left\lbrack {n,k} \right\rbrack}} + {{X_{4}\left\lbrack {n,k} \right\rbrack}}}} & \left( {46b} \right) \\{{R{M_{5,6}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{5,6}\rbrack}\lbrack k\rbrack}{{{X_{5}\left\lbrack {n,k} \right\rbrack} - {X_{6}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{5}\left\lbrack {n,k} \right\rbrack}} + {{X_{6}\left\lbrack {n,k} \right\rbrack}}}} & \left( {46c} \right) \\{{R{M_{7,8}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{7}\left\lbrack {n,k} \right\rbrack}} + {{X_{8}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{7,8}\rbrack}\lbrack k\rbrack}{{{X_{7}\left\lbrack {n,k} \right\rbrack} - {X_{8}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{7}\left\lbrack {n,k} \right\rbrack}} + {{X_{8}\left\lbrack {n,k} \right\rbrack}}}} & \left( {46d} \right)\end{matrix}$

A pairwise Phase Difference Normalization Vector (PDNV) Γ[k], whichscales the respective pairwise Noise (N) estimate, can be pre-computedfor each microphone pair spacing:

$\begin{matrix}{{\Gamma_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{1,2}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d_{1,2}}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {500\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {500\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {47a} \right) \\{{\Gamma_{\lbrack{3,4}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{3,4}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d_{3,4}}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {1000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {1000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {47b} \right) \\{{\Gamma_{\lbrack{5,6}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{5,6}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d_{5,6}}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {2000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {2000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {47c} \right) \\{{\Gamma_{\lbrack{7,8}\rbrack}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{7,8}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j2\pi w_{k}\beta \frac{d_{7,8}}{\lambda}si{n{(\theta_{\max})}}}}} \right)^{- 1}},} & {{{if}\mspace{14mu} \omega_{k}} \leq {4000\mspace{14mu} {Hz}}} \\{1,} & {{{if}\mspace{14mu} \omega_{k}} > {4000\mspace{14mu} {Hz}}}\end{matrix} \right.} & \left( {47d} \right)\end{matrix}$

Below a pre-determined frequency cutoff, Γ[k] is inversely proportionalto the discrete-frequency wrapped absolute phase difference

(see equation 27) at a maximum possible angular separation ofθ_(max)=π/2 radians. Although Γ[k] can be equivalent to the inverse of

across all discrete frequencies w_(k), here Γ[k] is set to unity at andabove a pre-determined frequency cutoff (see equation 47). Thisalternative processing, for the Human-Computer Interface (HCI) shown inFIG. 18, is illustrated in the block diagram in FIG. 22 (compare FIGS.20 and 22).

Absolute phase differences for the four microphone spacings (320, 180,80 and 40 mm) and three DOA angles (±30, ±60, ±90) are plotted in thetop row of FIG. 19. There is an interaction between frequency,microphone spacing and DOA angle (θ) that yields wrapped [π, π] absolutephase differences of zero at specific frequencies. Where the phasedifference is at or near zero, the target cancellation approach isineffective, as the interfering sound sources are cancelled at thesefrequencies and thereby are erroneously included in the frequency-domainsignal estimate (Ŝ={circumflex over (M)}−{circumflex over (N)}).Multiple microphone pairs are used to overcome this null phasedifference problem and thereby improve performance. This is furtherillustrated in FIG. 23 for a mixture of three concurrent talkers(compare FIGS. 19, 22 and 23).

Example Time-Frequency (T-F) masks for a mixture of three talkers areshown in FIG. 23. The three concurrent talkers were at −60°, 0° and+60°, with all three talkers at equal loudness. The target talker was“straight ahead” at 0° and the two interfering talkers were to the leftand right at ±60°. The Ratio Masks from the four microphone pairs([1,2], [3,4], [5,6] and [7,8]) are shown in the first four panels. Foreach of these Ratio Masks, there are frequencies at which there is nophase difference between target and interferer, resulting in bands ofT-F tiles with (incorrect) values of (or near) “1” (see horizontalwhitebands in the first three panels). However, multiple T-F masks fromthe multiple microphone pairs can be interfaced to yield a Global RatioMask RM_(G) (bottom Left panel) that is similar in appearance to theIdeal Ratio Mask (IRM) computed using “oracle knowledge” of the signaland noise components in the mixture. RM_(G) is an effective time-varyingfilter, with a vector of frequency channel weights for every analysisframe.

The processing computes multiple pairwise ratio masks for multiplemicrophone spacings (e.g., 320, 160, 80 and 40 mm). Each of the fourRatio Masks (RM_(1,2), RM_(3,4), RM_(5,6) RM_(7,8)) has frequency bandswhere the T-F tiles are being overestimated (see horizontal white bandswith values of “1” in FIG. 23). However, the multiple pairwise ratiomasks can be interfaced (FIGS. 19 and 22) to compute a chimeric (i.e.,composite) T-F mask which can look similar to the Ideal Ratio Mask (IRM)(see FIG. 23). Only the signals from the microphones (see FIG. 22) wereused as input, whereas the IRM, which has a transfer function equivalentto a time-varying Weiner filter, is granted access to the componentSignal (S) and Noise (N) terms:

$\begin{matrix}{{{IRM}\left( {t,f} \right)} = \frac{S^{2}\left( {t,f} \right)}{{S^{2}\left( {t,f} \right)} + {N^{2}\left( {t,f} \right)}}} & (48)\end{matrix}$

where S²(t, f) and N²(t, f), are the signal (i.e., target speech) energyand noise energy, respectively; i.e., the Ideal Ratio Mask has “oracleknowledge” of the signal and noise components. The STTC ALD is capableof computing a T-F mask, in real-time, that is similar to the IRM (seeFIG. 23), and does so without requiring any information about the noisesource(s).

Alternative Embodiments of STTC Human-Computer Interface (HCI).

Alternative embodiments of an STTC Human-Computer Interface (HCI) coulduse a variety of microphone array configurations and alternativeprocessing. For example, a “broadside” and/or “endfire” array ofmicrophone pairs could be incorporated into any number of locations andsurfaces in the dashboard or cockpit of a vehicle, or in the housing ofa smartphone or digital home assistant device. Furthermore, as describedin ¶0051 herein and in the original specification, τ sample shifts canbe used to steer the “look” direction of the microphone array. Hence,any number of microphone orientations, relative to the location of thetarget talker, can be used for an HCI application embodiment of theinvention. For example, the alternative processing for the thirdembodiment of the STTC ALD, described in paragraphs ¶0083-0093 andillustrated in FIGS. 15 and 16, could be adapted for use in an HCIapplication, with the microphones in an “endfire” array configurationrelative to the target talker, and the STTC processing steered 90°towards the target talker (or towards any designated “look” direction)by τ sample shifts; see ¶0051 herein and in the original specification.

Embodiment in a 2-Microphone Binaural Hearing Aid.

Although the devices described thus far have leveraged multiplemicrophone pairs to compute an effective time-varying filter that cansuppress non-stationary sound sources, the approach could also be usedin binaural hearing aids using only two near-ear microphones [44], asshown in FIG. 4. While the overall performance would not be comparableto that of the six microphone implementation, a two microphoneimplementation would indeed still provide a speech intelligibilitybenefit, albeit only for a “straight ahead” look direction of 0°; i.e.,the “look” direction would not be steerable. Because much of theprocessing is the same or similar as that of the 6-microphone assistivelistening device described earlier, the description below is limited tohighlighting the key differences when using only one pair of binauralin-ear microphones.

FIG. 24 is a block diagram of minimalist STTC processing for a singlepair of binaural in-ear (or near-ear) microphones [44]. It includes thefollowing distinct stages of calculations, similar to the processing ofFIG. 6 except for the use of only one, rather than three, microphonepairs: 1. Short-Time Fourier Transform (STFT) processing [97], convertseach microphone signal into frequency domain signal. 2. Ratio Mask (RM)processing [98], applied to frequency domain signals of the microphonepair. 3. Output signal processing [99], uses the ratio mask RM toscale/modify the binaural input signals to serve as binaural outputsignal(s) [16].

The STTC processing [98] would use only the signals from the binauralmicrophones, the Left and Right STFTs X_(L)[n, k] and X_(R)[n,k] [24],to compute a Ratio Mask (RM):

$\begin{matrix}{{R{M\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{R}\left\lbrack {n,k} \right\rbrack}} - {{{X_{L}\left\lbrack {n,k} \right\rbrack} - {X_{R}\left\lbrack {n,k} \right\rbrack}}}}{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{R}\left\lbrack {n,k} \right\rbrack}}}} & (49)\end{matrix}$

If there is only one pair of microphones, and therefore only one RatioMask (RM) is computed, then the Global Ratio Mask (RM_(G)) and thesingle Ratio Mask (RM) are equivalent; i.e., RM_(G)[n, k]=RM[n, k].

For the output signal reconstruction [99], the RM_(G)[n, k] T-F mask(i.e., time-varying filter) can be used to filter the signals from theLeft and Right near-ear microphones [44]:

Y _(L)[n,k]=RM_(G)[n,k]×X[n,k]Y _(R)[n,k]=RM_(G)[n,k]×X _(R)[n,k]  (50)

Synthesis of a stereo output (y_(L)[m] and y_(R)[m]) estimate of thetarget speech signal consists of taking the Inverse Short Time FourierTransforms (ISTFTs) of Y_(L)[n, k] and Y_(R) [n, k] and using theoverlap-add method of reconstruction. The minimalist processingdescribed here would provide a speech intelligibility benefit, for atarger talker “straight ahead” at 0°, while still preserving binauralcues. Alternative processing might include using a Thresholded RatioMask (RM_(T)), as described in the previous sections, for computing theoutputs Y_(L) and Y_(R).

A Binary Mask BM[n, k] may also be computed using a thresholdingfunction, with threshold value ψ, which may be set to a fixed value ofψ=0.2 for example:

$\begin{matrix}{{B{M\left\lbrack {n,k} \right\rbrack}} = \left\{ \begin{matrix}{1\ } & {{{if}\ R{M\left\lbrack {n,k} \right\rbrack}} \geq \psi} \\{0\ } & {{{if}\ R{M\left\lbrack {n,k} \right\rbrack}} < \psi}\end{matrix} \right.} & (51)\end{matrix}$

When using only one pair of microphones, the Thresholded Ratio Mask(RM_(T)) is the product of the Ratio Mask and Binary Mask:

RM_(T)[n,k]=RM[n,k]×BM[n,k]  (52)

For this alternative processing for the output signal reconstruction[99], when using only one pair of microphones, the RM_(T)[n, k] T-F mask(i.e., time-varying filter) can be used to filter the signals from theLeft and Right near-ear microphones [44]:

Y _(L)[n,k]=RM_(T)[n,k]×X _(L)[n,k]Y _(R)[n k]=RM_(T)[n,k]×X_(R)[n,k]  (53)

Alternative Processing and Alternative Embodiments of an STTC BinauralHearing Aid.

Alternative processing, which now incorporates the Phase DifferenceNormalization Vector (PDNV) computation described earlier in ¶0065-0068,is illustrated in the following pages and in FIGS. 25-30, which detailvariations of an STTC binaural hearing aid.

Alternative Two-Microphone Binaural Processing with Phase DifferenceNormalization

A pairwise “Left,Right” Ratio Mask RM_(L,R) can also be calculated asfollows, using the signals from a “Left, Right” ([L,R]) pair of binauralmicrophones:

$\begin{matrix}{{R{M_{L,R}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{R}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{L,R}\rbrack}\lbrack k\rbrack}{{{X_{L}\left\lbrack {n,k} \right\rbrack} - {X_{R}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{R}\left\lbrack {n,k} \right\rbrack}}}} & (54)\end{matrix}$

A pairwise Phase Difference Normalization Vector (PDNV) Γ_(L,R)[k],which scales the pairwise Noise ({circumflex over (N)}) estimate, can bepre-computed for the [L,R] microphone pair spacing:

$\begin{matrix}{{\Gamma_{\;^{\lbrack{L,R}\rbrack}}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{L,R}\rbrack}\lbrack k\rbrack} = \left( {{\angle e}^{j\; 2\; \pi \; \omega_{k}\beta_{L,R}\frac{d_{L,R}}{\lambda}si{n{(\theta_{\max})}}}} \right)^{- 1}},} & {{{if}\ \omega_{k}} \leq {F_{c}\mspace{11mu} {Hz}}} \\{1,} & {{{if}\ \omega_{k}} > {F_{c}\ {Hz}}}\end{matrix} \right.} & (55)\end{matrix}$

Here we assume that the target talker is “straight ahead” at 0°; i.e.,directly in front of the ALD user. Hence, the “Left,Right” processingdoes not need to be steered via τ sample shifts and β_(L,R) is given thedefault unity value (i.e., β_(L,R)=1). Note that in order to computeΓ_([L,R])[k], the distance in meters between the two microphones,d_(L,R), needs to be either known or estimated. Hence, this d_(L,R)value may need to be determined and/or tuned for users, since these arebinaural microphones and there is a range of human head widths. As adefault value, we can assume that d_(L,R)=150 mm, which is the width ofthe average human head. Modifications might also have to be made to thecomputation of Γ_([L,R])[k], shown in equation 55, to account forfrequency-dependent ITD, ILD and interaural phase differences caused byhead shadowing.

Below a pre-determined frequency cutoff F_(c), the PDNV Γ_([L,R])[k] isinversely proportional to the discrete-frequency wrapped absolute phasedifference

(see equation 27) at a maximum possible angular separation ofθ_(max)=π/2 radians. Although the PDNV Γ[k] can be equivalent to theinverse of

across all discrete frequencies w_(k), here Γ[k] is set to unity at andabove a pre-determined frequency cutoff. This alternative processing,for two-microphone binaural processing with Phase DifferenceNormalization, is illustrated in the block diagram in FIG. 25 (compareFIGS. 24 and 25). An optional “Channel Weighting” post-processing step(see ¶0080-0081) smooths RM_(L,R)[n, k] along the frequency axis toyield the Smoothed Ratio Mask RM_(S), which can then be applied to thesignals from the Left and Right ears (see FIG. 25).

Alternative Dual-Monaural STTC Processing with Binaural Microphone Pairs

A second embodiment in a binaural hearing aid would use a pair ofnear-ear microphones in each ear, and would adapt the pairwiseprocessing to compute a Ratio Mask independently for the Left and Rightears, respectively. This is illustrated in the block diagram shown inFIG. 26, and for the ALD shown in FIG. 27.

FIG. 27 shows an example physical realization of an assistive listeningdevice or ALD, specifically as a set of microphones and loudspeakersworn by a user. In this realization, the microphones are two pairs oftwo near-ear microphones [44-R], [44-L]. The near-ear microphones [44]are included in respective right and left earbuds [46-R], [46-L] alongwith corresponding in-ear loudspeakers [48-R], [48-L]. This iscomparable to the binaural (i.e., left and right) earbuds describedherein, and in the original specification, in ¶0021 and FIG. 4, albeitwith a pair of in-ear microphones in both the Left ([L, L2]) and Right([R, R2]) earbuds, respectively.

As described in ¶0051 herein and in the original specification, τ sampleshifts can be used to steer the “look” direction of the microphonearray. As shown on the far Left side of FIG. 26, τ sample shifts delaythe signals from the anterior L and R microphones (See FIG. 27),relative to the posterior L2 and R2 microphones, before Time-Frequencyanalysis, so as to steer the “look” direction by 90°, towards a targettalker in front of the ALD user. The τ sample shifts are computed for agiven microphone spacing where F_(s) is the sampling rate, d_(L) andd_(R) are the inter-microphone spacing in meters for the Left ([L, L2])and Right ([R, R2]) side microphone pairs, λ is the speed of sound inmeters per second and θ is the specified angular “look” direction inradians:

$\begin{matrix}{\tau_{L} = {{\left\lfloor {f_{s} \times \frac{d_{L}}{\lambda}\sin \; (\theta)} \right\rceil \mspace{14mu} \tau_{R}} = \left\lfloor {f_{s} \times \frac{d_{R}}{\lambda}\sin \; (\theta)} \right\rceil}} & (56)\end{matrix}$

Values of

$\theta = \frac{\pi}{2}$

and d=10 mm (i.e., d_(L)=10 mm and d_(R)=10 mm) are used for theprocessing and array configuration illustrated in FIGS. 26 and 27.Because the “look” direction is steered 90° (i.e.,

$\theta = \frac{\pi}{2}$

radians), a value of β=2 is used for the scaling parameters β_(L) andβ_(R) (i.e., β_(L)=2 and β_(R)=2) that are used to compute the Γ_(L) [k]and Γ_(R)[k] Phase Difference Normalization Vectors (PDNV) for the Left([L,L2]) and Right ([R,R2]) microphone pairs, respectively.

Pairwise Left and Right ratio masks, RM_(L) and RM_(R), can becalculated as follows; i.e., there is a unique RM for the respectiveLeft and Right microphone pairs ([L, L2], [R, R2]):

$\begin{matrix}{{R{M_{L}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{L2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{L}\lbrack k\rbrack}{{{X_{L}\left\lbrack {n,k} \right\rbrack} - {X_{L2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{L2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {57a} \right) \\{{R{M_{R}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{R}\left\lbrack {n,k} \right\rbrack}} + {{X_{R2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{R}\lbrack k\rbrack}{{{X_{R}\left\lbrack {n,k} \right\rbrack} - {X_{R2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{R}\left\lbrack {n,k} \right\rbrack}} + {{X_{R2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {57b} \right)\end{matrix}$

Left and Right side pairwise Phase Difference Normalization Vectors(PDNV)Γ_(L)[k] and Γ_(R)[k], which scale the respective pairwise Noise(N) estimates in equation 57, can be pre-computed for the d_(L) andd_(R) microphone pair spacings, which are 10 mm in the exampleillustrated in FIGS. 26 and 27:

$\begin{matrix}{{\Gamma_{L}\lbrack k\rbrack} = \left\{ \begin{matrix}{\frac{1}{_{L}\lbrack k\rbrack} = \left( {{\angle e}^{j\; 2{\pi\omega}_{k}\beta_{L}\frac{d_{L}}{\lambda}\sin \; {(\theta_{\max})}}} \right)^{- 1}} & {{{if}\ \omega_{k}} \leq {F_{c}\ {Hz}}} \\{1,} & {{{if}\ \omega_{k}} > {F_{c}\ {Hz}}}\end{matrix} \right.} & \left( {58a} \right) \\{{\Gamma_{R}\lbrack k\rbrack} = \left\{ \begin{matrix}{\frac{1}{_{R}\lbrack k\rbrack} = \left( {{\angle e}^{j\; 2\pi \mspace{2mu} \omega_{k}\beta_{R}\frac{d_{R}}{\lambda}\sin \; {(\theta_{\max})}}} \right)^{- 1}} & {{{if}\ \omega_{k}} \leq {F_{c}\ {Hz}}} \\{1,} & {{{if}\ \omega_{k}} > {F_{c}\ {Hz}}}\end{matrix} \right.} & \left( {58b} \right)\end{matrix}$

Below a pre-determined frequency cutoff F_(c), the pairwise PDNV Γ[k] isinversely proportional to the discrete-frequency wrapped absolute phasedifference

(see equation 58) at a maximum possible angular separation of

$\theta_{\max} = \frac{\pi}{2}$

radians. Although the pairwise PDNV Γ[k] can be equivalent to theinverse of

across all discrete frequencies w_(k), here Γ[k] is set to unity at andabove a pre-determined frequency cutoff. This alternative processing isillustrated in the block diagram in FIG. 26. An optional “ChannelWeighting” post-processing step (see ¶0080-0081) smooths RM_(L,R)[n, k]along the frequency axis to yield the Smoothed Ratio Mask RM_(S), whichcan then be applied to the signals from the Left and Right ears (seeFIG. 26).

Alternative STTC Binaural Hearing Aid with Phase DifferenceNormalization

A third example embodiment of a binaural hearing aid with STTCprocessing combines the first and second embodiments, with both binauraland “dual monaural” processing. The “piecewise construction” approach,described herein and in the original specification, is used to compute aGlobal Ratio Mask RM_(G) from pairwise Ratio Masks (RM) computed withvaried microphone spacings. This third example embodiment uses both a150 mm spacing ([L, R]) and a 10 mm spacing ([L, L2] and [R, R2]), asillustrated in FIG. 27.

Absolute phase differences for the two microphone spacings (150 and 10mm) and three Direction of Arrival (DOA) angles (±30, ±60, ±90) areplotted in the top row of FIG. 28. There is an interaction betweenfrequency, microphone spacing and Direction of Arrival angle (θ) thatyields wrapped [π, π] absolute phase differences of zero at specificfrequencies. Where the phase difference is at or near zero, the targetcancellation approach is ineffective, as the interfering sound sourcesare cancelled at these frequencies and thereby are erroneously includedin the frequency-domain signal estimate (Ŝ={circumflex over(M)}−{circumflex over (N)}). Multiple microphone pairs are used toovercome this null phase difference problem and thereby improveperformance.

One disadvantage of using narrowly spaced microphones is that thereisn't much phase difference for the STTC processing to work with,especially at low frequencies. Hence the approach taken with this thirdembodiment is to use the wider spacing of the binaural ([L,R])microphone pair for the lower frequencies (<2 kHz), and to use the morenarrowly spaced “dual monaural” ([L, L2] and [R, R2]) microphone pairsfor the ≈2-3 kHz frequency range(s) where the binaural microphone pairsuffers from null phase differences; this “piecewise construction”approach is illustrated in the bottom row of FIG. 28.

Block diagrams for this third example embodiment, of a binaural hearingaid with STTC processing, are shown in FIGS. 29 and 30; compare with thefirst “binaural” embodiment (FIGS. 24 and 25) and the second “dualmonaural” embodiment (FIG. 26) and note that this third embodimenteffectively combines the processing described for the first twoembodiments, albeit with the “piecewise construction” approach describedin the original specification.

As described in ¶0051 herein and in the original specification, τ sampleshifts can be used to steer the “look” direction of the microphonearray. Here we assume that the target talker is “straight ahead” at 0°;i.e., directly in front of the ALD user. Hence, the “Left, Right”processing for the binaural microphone pair ([L,R]) does not need to besteered via τ sample shifts and β_(L,R) is given the default unity value(i.e., β_(L,R)=1). However, the “look” directions of the [L, L2] and [R,R2] microphone pairs will be steered 90°; i.e., towards the targettalker.

As shown on the far Left side of FIGS. 29 and 30, τ sample shifts delaythe signals from the anterior L and R microphones (See FIG. 27),relative to the posterior L2 and R2 microphones, before Time-Frequencyanalysis, so as to steer the “look” direction by 90°, towards a targettalker in front of the ALD user. The τ sample shifts are computed for agiven microphone spacing where F_(s) is the sampling rate, d_(L) andd_(R) are the inter-microphone spacing in meters for the Left ([L, L2])and Right ([R, R2]) side microphone pairs, λ is the speed of sound inmeters per second and θ is the specified angular “look” direction inradians:

$\begin{matrix}{\tau_{L} = {{\left\lfloor {f_{s} \times \frac{d_{L}}{\lambda}\sin \; (\theta)} \right\rceil \mspace{14mu} \tau_{R}} = \left\lfloor {f_{s} \times \frac{d_{R}}{\lambda}\sin \; (\theta)} \right\rceil}} & (59)\end{matrix}$

Values of

$\theta = \frac{\pi}{2}$

and d=10 mm (i.e., d_(L)=10 and d_(R)=10 mm) are used for the processingand array configuration illustrated in FIGS. 26 and 27. Because the“look” direction is steered 90° (i.e.,

$\theta = \frac{\pi}{2}$

radians), a value of β=2 is used for the scaling parameters β_(L) andβ_(R) (i.e., β_(L)=2 and β_(R)=2) used to compute Γ_(L) [k] and Γ_(R)[k]for the Left ([L,L2]) and Right ([R,R2]) microphone pairs, respectively.As illustrated on the left hand side of FIGS. 29 and 30, the τ sampleshifts are used to delay x_(L)[m] and x_(R)[m]; although the sameX_(L)[n, k] and X_(R)[n, k] notation is used for all three Ratio Mask(RM) computations, X_(L)[n, k] and X_(R)[n, k] are in this case localvariables, computed uniquely for each of the three RM computations.

The “piecewise construction” STTC processing for this third embodimentis illustrated in FIGS. 27-30. Each of the three microphone pairs([L,R], [L,L2], [R,R2]) yields a Ratio Mask (RM_(L,R), RM_(L) andRM_(R)). Here the chimeric Global Ratio Mask RM_(G) has the 0 to 2 kHzand 3 to 4 kHz frequency channels from RM_(L,R) and the 2 to 3 kHz and 4kHz to F/2 frequency channels from RM_(L) and RM_(R) (see FIG. 28).

Pairwise ratio masks RM are calculated as follows; i.e., there is aunique RM for each pair of microphones ([L,R], [L,L2], [R,R2]):

$\begin{matrix}{{R{M_{L,R}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{R}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{\lbrack{L,R}\rbrack}\lbrack k\rbrack}{{{X_{L}\left\lbrack {n,k} \right\rbrack} - {X_{R}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{R}\left\lbrack {n,k} \right\rbrack}}}} & \left( {60a} \right) \\{{{RM}_{L}\left\lbrack {n,k} \right\rbrack} = \frac{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{L2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{L}\lbrack k\rbrack}{{{X_{L}\left\lbrack {n,k} \right\rbrack} - {X_{L2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{L}\left\lbrack {n,k} \right\rbrack}} + {{X_{L2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {60b} \right) \\{{R{M_{R}\left\lbrack {n,k} \right\rbrack}} = \frac{{{X_{R}\left\lbrack {n,k} \right\rbrack}} + {{X_{R2}\left\lbrack {n,k} \right\rbrack}} - {{\Gamma_{R}\lbrack k\rbrack}{{{X_{R}\left\lbrack {n,k} \right\rbrack} - {X_{R2}\left\lbrack {n,k} \right\rbrack}}}}}{{{X_{R}\left\lbrack {n,k} \right\rbrack}} + {{X_{R2}\left\lbrack {n,k} \right\rbrack}}}} & \left( {60c} \right)\end{matrix}$

A pairwise Phase Difference Normalization Vector (PDNV) Γ[k], whichscales the respective pairwise Noise ({circumflex over (N)}) estimate,can be pre-computed for each microphone pair spacing:

$\begin{matrix}{{\Gamma_{\;^{\lbrack{L,R}\rbrack}}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{\lbrack{L,R}\rbrack}\lbrack k\rbrack} = \left( {{\angle \; e^{j\; 2{\pi\omega}_{k}\beta \frac{d_{L,R}}{\lambda}\sin \; {(\theta_{\max})}}}} \right)^{- 1}},} & {{{if}\ \omega_{k}} \leq {F_{c_{L,R}}{Hz}}} \\{1,} & {{{if}\ \omega_{k}} > {F_{c_{L,R}}{Hz}}}\end{matrix} \right.} & \left( {61a} \right) \\{{\Gamma_{L}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{L}\lbrack k\rbrack} = \left( {{\angle \; e^{j\; 2{\pi\omega}_{k}\beta_{L}\frac{d_{L}}{\lambda}\sin \; {(\theta_{\max})}}}} \right)^{- 1}},} & {{{if}\ \omega_{k}} \leq {F_{c_{L}}{Hz}}} \\{1,} & {{{if}\ \omega_{k}} > {F_{c_{L}}{Hz}}}\end{matrix} \right.} & \left( {61b} \right) \\{{\Gamma_{R}\lbrack k\rbrack} = \left\{ \begin{matrix}{{\frac{1}{_{R}\lbrack k\rbrack} = \left( {{\angle \; e^{j\; 2{\pi\omega}_{k}\beta_{R}\frac{d_{R}}{\lambda}\sin \; {(\theta_{\max})}}}} \right)^{- 1}},} & {{{if}\ \omega_{k}} \leq {F_{c_{R}}{Hz}}} \\{1,} & {{{if}\ \omega_{k}} > {F_{c_{R}}{Hz}}}\end{matrix} \right.} & \left( {61c} \right)\end{matrix}$

Below the pre-determined frequency cutoffs F_(c) _(L,R) , F_(c) _(L) andF_(c) _(R) , the pairwise PDNV Γ[k] is inversely proportional to thediscrete-frequency wrapped absolute phase difference

(see equation 61) at a maximum possible angular separation of θ=π/2radians. Although the pairwise PDNV Γ[k] can be equivalent to theinverse of

across all discrete frequencies w_(k), here Γ[k] is set to unity at andabove a pre-determined frequency cutoff. This alternative processing,for a binaural hearing aid, is illustrated in the block diagrams inFIGS. 29 and 30.

The block diagrams in FIGS. 29 and 30 illustrate two variations on theprocessing. In FIG. 29, the “Piecewise Construction” is effectedindependently for the Left and Right ears, with frequency channels forthe Left side RM_(G) chosen from RM_(L) and frequency channels for theRight side RM_(G) chosen from RM_(R). In FIG. 30, only one RM_(G), or apost-processed variant thereof, is computed and applied to the signalsat both ears, so as to preserve binaural cues for spatial hearing. Theblock diagram in FIG. 29 also illustrates an optional post processing“Channel Weighting” (i.e., smoothing along frequency) step, as describedin ¶0080-0081.

Yet another variation on the processing described here could use thereconstruction stage described in ¶0081-0082, and illustrated on theright side of FIG. 14, wherein a user-defined “mix” parameter α wouldallow the user to determine the ratio of processed and unprocessedoutput. Further variations might allow the user, or an audiologist, todetermine the value of certain parameters, for example, the d_(L,R)parameter specifying the distance in meters between the Left and Rightin-ear microphones, the β value used to compute the PDNV, or whether touse frequency channels from the widely spaced [L, R] microphones, orfrom the narrowly spaced ([L, L2] and [R, R2]) microphones, for the 3-4kHz frequency range (see FIG. 28).

STTC Processing can be Used as a Post-Filter for Fixed and/or AdaptiveBeamforming.

Alternative processing could also involve using the Global Ratio MaskRM_(G), or a post-processed variant thereof, as a postfilter for a fixedand/or adaptive beamformer. The beamforming could be implemented usingthe same array of microphones, or a subset thereof, used for the STTCprocessing. This was described in ¶0049-0052 and FIG. 12 of the originalspecification for a simple fixed beamformer, where the T-F mask computedby STTC processing was used as a post-filter for the average of thefrequency domain signals from all microphones in the array. Fixed andadaptive beamforming techniques generally yield a mono output, hencethere is a potential tradeoff here between enhancing speechintelligibility, and/or speech quality, at the expense of the loss ofbinaural cues for spatial hearing. The ideal mix of processed andunprocessed output, and of STTC processing and beamforming, could bedefined by the user, either beforehand or online via a user interface,for example via a smartphone application.

As mentioned in ¶0013 herein and in the original specification, anadvantage of the STTC processing described herein, relative to adaptivebeamforming techniques, such as the MWF and MVDR beamformers, whichgenerally have diotic (i.e., mono) outputs, is that the time-varyingfilter computed by the STTC processing is a set of frequency channelweights that can be applied independently to signals at the Left andRight ear, thereby enhancing speech intelligibility for a target talkerwhile still preserving binaural cues for spatial hearing.

When using the STTC T-F mask as a post-filter for fixed and/or adaptivebeamforming, any benefit measured in objective measures of performance(i.e., noise reduction, speech intelligibility, speech quality) may beoffset by the loss of binaural cues for spatial hearing, which areimportant for maintaining a sense of spatial and situational awareness.The user of the assistive listening device, or machine hearing device,can determine for themselves, and for their current listeningenvironment, the ideal combination of STTC processing, fixed and/oradaptive beamforming, and unprocessed output via a user-interface.

STTC Processing can be Used for Online Remote Communication BetweenConversants.

As mentioned in ¶0009 herein and in the original specification, STTCprocessing can be implemented as a computer-integrated front-end forteleconferencing (i.e., remote communication); more generally, the STTCfront-end approach may be used for Human-Computer Interaction (HCI) inenvironments with multiple competing talkers, such as air-trafficcontrol towers, and variations could be integrated into use-environmentstructures such as the cockpit of an airplane. Hence the STTCprocessing, which can enhance speech intelligibility in real-time, couldbe used on both ends of an online remote communication between multiplehuman conversants, for example, between an air-traffic controller and anairplane pilot, both of whom might be in a noisy environment withmultiple stationary and/or non-stationary interfering sound sources.

While various embodiments of the invention have been particularly shownand described, it will be understood by those skilled in the art thatvarious changes in form and details may be made therein withoutdeparting from the spirit and scope of the invention as defined by theappended claims.

What is claimed:
 1. An assistive listening device for use in thepresence of stationary interfering sound sources and/or non-stationaryinterfering sound sources, comprising an array of microphones arrangedinto a set of microphone pairs positioned about an axis with respectivedistinct intra-pair microphone spacings, each microphone of the array ofmicrophones generating a respective audio input signal; a pair ofear-worn loudspeakers; and audio circuitry configured to compute a setof time-varying filters, for real-time speech intelligibilityenhancement, using causal and memoryless frame-by-frame processing,comprising (1) applying a short-time frequency transform to each of therespective audio input signals, thereby converting the respective timedomain signals into respective frequency-domain signals for everyshort-time analysis frame, (2) calculating a pairwise noise estimate byfirst subtracting the respective frequency-domain signals from amicrophone pair and thereafter taking the magnitude of the difference,(3) calculating a pairwise mixture estimate by first taking themagnitudes of the respective frequency domain signals from a microphonepair, and thereafter adding the respective magnitudes, (4) scaling apairwise noise estimate by a pre-computed pairwise Phase DifferenceNormalization Vector (PDNV), which normalizes a pairwise noise estimate,at each discrete frequency, in a manner dependent on the value of themaximum possible phase difference, at each discrete frequency, for agiven microphone pair spacing, and (5) calculating a pairwise ratio maskfrom the pairwise noise estimate and the pairwise mixture estimate foreach of the respective microphone pairs, wherein the calculation of apairwise ratio mask includes the aforementioned frequency-domainsubtraction of signals and scaling of a pairwise noise estimate by apre-computed pairwise PDNV, (6) calculating a global ratio mask, whichis an effective time-varying filter with a vector of frequency channelweights for every short-time analysis frame, from the set of pairwiseratio masks, with the frequency channels from each pairwise ratio maskchosen according to the frequency range(s) for which the distinctintra-pair microphone spacing provides a positive absolute phasedifference; wherein when using only one pair of microphones, thesingular pairwise ratio mask and the global ratio mask are equivalent,and (7) applying the global ratio mask, or a post-processed variantthereof, and inverse short-time frequency transforms, to selected onesof the frequency-domain signals, or to the frequency-domain output of afixed or adaptive beamformer that operates in parallel using the samearray of microphones (or a subset thereof), thereby suppressing both thestationary and the non-stationary interfering sound sources in real-timeand generating an audio output signal for driving the loudspeakers. 2.The assistive listening device of claim 1, wherein the array ofmicrophones includes a set of one or more pairs of microphones withpredetermined intra-pair microphone spacings.
 3. The assistive listeningdevice of claim 1, wherein the array of microphones are arranged on ahead-worn frame worn by a user.
 4. The assistive listening device ofclaim 3, wherein the head-worn frame is an eyeglass frame.
 5. Theassistive listening device of claim 4, wherein the array of microphonesare arranged across a front of the eyeglass frame.
 6. The assistivelistening device of claim 4, wherein the array of microphones includesmicrophones arranged on at least one of the temple pieces (i.e., stems)of the eyeglass frame.
 7. The assistive listening device of claim 1,wherein the array of microphones includes in-ear or near-ear microphoneswhose corresponding frequency-domain signals are the selectedfrequency-domain signals to which the global ratio mask, or apost-processed variant thereof, and inverse short-time frequencytransforms are applied.
 8. The assistive listening device of claim 1,wherein the processed and unprocessed frequency-domain signals arecombined before applying inverse short-time frequency transforms, and auser of the device determines the mixture of processed and unprocessedoutput, either beforehand or online via a user-interface.
 9. A machinehearing device for generating speech signals to be used in identifyingsemantic content in the presence of stationary interfering sound sourcesand/or non-stationary interfering sound sources, and thereby allowingfor remote communication and/or the performance of automated actions byrelated systems in response to the identified semantic content, thehearing device comprising: a set of microphones generating respectiveaudio input signals arranged in an array having a set of microphonepairs arranged about an axis with pre-determined intra-pair microphonespacings; and audio circuitry configured to compute a set oftime-varying filters, for real-time speech intelligibility enhancement,using causal and memoryless frame-by-frame processing, comprising (1)applying a short-time frequency transform to each of the respectiveaudio input signals, thereby converting the respective time domainsignals into respective frequency-domain signals for every short-timeanalysis frame, (2) calculating a pairwise noise estimate by firstsubtracting the respective frequency-domain signals from a microphonepair and thereafter taking the magnitude of the difference, (3)calculating a pairwise mixture estimate by first taking the magnitudesof the respective frequency domain signals from a microphone pair, andthereafter adding the respective magnitudes, (4) scaling a pairwisenoise estimate by a pre-computed pairwise Phase Difference NormalizationVector (PDNV), which normalizes a pairwise noise estimate, at eachdiscrete frequency, in a manner dependent on the value of the maximumpossible phase difference, at each discrete frequency, for a givenmicrophone pair spacing, and (5) calculating a pairwise ratio mask fromthe pairwise noise estimate and the pairwise mixture estimate for eachof the respective microphone pairs, wherein the calculation of apairwise ratio mask includes the aforementioned frequency-domainsubtraction of signals and scaling of a pairwise noise estimate by apre-computed pairwise PDNV, (6) calculating a global ratio mask, whichis an effective time-varying filter with a vector of frequency channelweights for every short-time analysis frame, from the set of pairwiseratio masks, with the frequency channels from each pairwise ratio maskchosen according to the frequency range(s) for which the distinctintra-pair microphone spacing provides a positive absolute phasedifference; wherein when using only one pair of microphones, thesingular pairwise ratio mask and the global ratio mask are equivalent,and (7) applying the global ratio mask, or a post-processed variantthereof, and inverse short-time frequency transforms, to selected onesof the frequency-domain signals, or to the frequency-domain output of afixed or adaptive beamformer that operates in parallel using the samearray of microphones (or a subset thereof), thereby suppressing both thestationary and the non-stationary interfering sound sources in real-timeand allowing for identification of the target speech signal.
 10. Themachine hearing device of claim 9, wherein the array of microphonesincludes a set of one or more pairs of microphones with predeterminedintra-pair microphone spacings.
 11. The machine hearing device of claim9, wherein the array of microphones are arranged along a border of adisplay that can be positioned in front of a user.
 12. The machinelistening device of claim 9, wherein the array of microphones isintegrated into the housing of a digital device that responds to voicecommands.
 13. The assistive listening device of claim 9, wherein thearray of microphones is integrated into the housing of a portabledigital device.
 14. The machine hearing device of claim 9, wherein thehardware configuration is adapted for remote communication in one ormore noisy listening environments.
 15. The machine hearing device ofclaim 9, wherein the hardware configuration is adapted for remotecommunication between two or more human conversants.
 16. The machinehearing device of claim 9, wherein the array of microphones isintegrated into a use-environment structure.
 17. The machine hearingdevice of claim 16, wherein the use-environment structure is the cabinor cockpit of a vehicle.
 18. An assistive listening device for use inthe presence of stationary interfering sound sources and/ornon-stationary interfering sound sources, comprising One or more pairsof in-ear or near-ear microphones, each microphone generating arespective audio input signal; a pair of ear-worn loudspeakers; andaudio circuitry configured to compute a time-varying filter, forreal-time speech intelligibility enhancement, using causal andmemoryless frame-by-frame processing, comprising (1) applying ashort-time frequency transform to each of the respective audio inputsignals, thereby converting the respective time domain signals intorespective frequency-domain signals for every short-time analysis frame,(2) calculating a pairwise noise estimate by first subtracting therespective frequency-domain signals from a microphone pair andthereafter taking the magnitude of the difference, (3) calculating apairwise mixture estimate by first taking the magnitudes of therespective frequency-domain signals from a microphone pair, andthereafter adding the respective magnitudes, (4) scaling a pairwisenoise estimate by a pre-computed pairwise Phase Difference NormalizationVector (PDNV), which normalizes a pairwise noise estimate, at eachdiscrete frequency, in a manner dependent on the value of the maximumpossible phase difference, at each discrete frequency, for a givenmicrophone pair spacing, and (5) calculating a pairwise ratio mask fromthe pairwise noise estimate and the pairwise mixture estimate for eachof the respective microphone pairs, wherein the calculation of apairwise ratio mask includes the aforementioned frequency-domainsubtraction of signals and scaling of a pairwise noise estimate by apre-computed pairwise PDNV, (6) calculating a global ratio mask, whichis an effective time-varying filter with a vector of frequency channelweights for every short-time analysis frame, from the set of pairwiseratio masks, with the frequency channels from each pairwise ratio maskchosen according to the frequency range(s) for which the distinctintra-pair microphone spacing provides a positive absolute phasedifference; wherein when using only one pair of microphones, thesingular pairwise ratio mask and the global ratio mask are equivalent,and (7) applying the global ratio mask, or a post-processed variantthereof, and inverse short-time frequency transforms, to thefrequency-domain signals from the in-ear or near-ear microphones, or tothe frequency-domain output of a fixed or adaptive beamformer thatoperates in parallel using the same array of microphones (or a subsetthereof), thereby suppressing both the stationary and the non-stationaryinterfering sound sources in real-time and generating an audio outputsignal for driving the loudspeakers.
 19. The assistive listening deviceof claim 18, wherein values of a set of processing parameters can bespecified and/or tuned by an audiologist, and/or by the user of thedevice, either beforehand or online via a user interface.
 20. Theassistive listening device of claim 18, wherein the processed andunprocessed frequency-domain signals are combined before applyinginverse short-time frequency transforms, and a user of the devicedetermines the mixture of processed and unprocessed output, eitherbeforehand or online via a user interface.